Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task genrator debug api #9058

Merged
merged 23 commits into from
Jul 22, 2022
Merged

Conversation

saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented Jul 14, 2022

Save task generator info into an inmemory map, and provide a controller API to fetch the details for a given table
design doc: https://docs.google.com/document/d/1bCn13k57WLSaIAQGfATP75VS3p6V_ievlBi_HoOtmZ8/edit?usp=sharing

Release Notes

Added an API GET /tasks/generator/{tableNameWithType}/{taskType}/debug which returns the success time / error message for the 5 most recent task generator runs.

@saurabhd336
Copy link
Contributor Author

@npawar @mcvsubbu I've raised this PR in place of the now closed PR #9043

@mcvsubbu
Copy link
Contributor

Overall, I am a plus 1 on more minion debuggability.

My view is that we should add APIs to minions similar to that of servers, and pull debugging info from there and display it.

@npawar
Copy link
Contributor

npawar commented Jul 14, 2022

Overall, I am a plus 1 on more minion debuggability.

My view is that we should add APIs to minions similar to that of servers, and pull debugging info from there and display it.

We already have minion APIs (look for /debug endpoint in Tasks tab). This is for scheduler errors, and the scope of scheduler remains limited to the controller, hence thought it makes most sense as an API here

} catch (Exception e) {
for (TableConfig tableConfig : enabledTableConfigs) {
try {
StringWriter errors = new StringWriter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we will do the same thing for each table config, and then put the same stacktrace for every table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -37,7 +37,13 @@ public FakePropertyStore() {

@Override
public ZNRecord get(String path, Stat stat, int options) {
return _contents.get(path);
ZNRecord znRecord = _contents.get(path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change needed?

/**
* Returns the current information version, it should be consistent with the corresponding {@link ZNRecord} version.
*/
public int getVersion() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prolly don't need this method either

_tableNameWithType = tableNameWithType;
_taskType = taskType;
// sort and keep the most recent several error messages
_mostRecentErrorRunMessage = new TreeMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems this constructor is only ever called from within this same class, with empty list and map. Do we need all this sorting logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If at all we move to persistent storage, we might need to serialize / deserialize instances of this class using an allArgsConstructor

@codecov-commenter
Copy link

codecov-commenter commented Jul 15, 2022

Codecov Report

Merging #9058 (35d8326) into master (f0a4b44) will decrease coverage by 1.31%.
The diff coverage is 26.38%.

❗ Current head 35d8326 differs from pull request most recent head 9fa0078. Consider uploading reports for the commit 9fa0078 to get more accurate results

@@             Coverage Diff              @@
##             master    #9058      +/-   ##
============================================
- Coverage     70.06%   68.74%   -1.32%     
+ Complexity     4743     4656      -87     
============================================
  Files          1832     1838       +6     
  Lines         96971    97224     +253     
  Branches      14620    14640      +20     
============================================
- Hits          67939    66840    -1099     
- Misses        24305    25680    +1375     
+ Partials       4727     4704      -23     
Flag Coverage Δ
integration1 26.25% <20.84%> (-0.14%) ⬇️
integration2 ?
unittests1 66.85% <1.16%> (-0.06%) ⬇️
unittests2 15.30% <10.86%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ache/pinot/broker/broker/AccessControlFactory.java 73.33% <ø> (-6.67%) ⬇️
...broker/broker/ZkBasicAuthAccessControlFactory.java 0.00% <ø> (ø)
...e/pinot/broker/broker/helix/BaseBrokerStarter.java 76.99% <ø> (-2.82%) ⬇️
.../BrokerResourceOnlineOfflineStateModelFactory.java 55.81% <ø> (ø)
...quota/HelixExternalViewBasedQueryQuotaManager.java 67.71% <ø> (-1.35%) ⬇️
...che/pinot/broker/routing/BrokerRoutingManager.java 85.31% <ø> (ø)
...elector/SegmentLineageBasedSegmentPreSelector.java 100.00% <ø> (ø)
.../segmentpreselector/SegmentPreSelectorFactory.java 100.00% <ø> (ø)
...oker/routing/segmentpruner/EmptySegmentPruner.java 87.93% <ø> (ø)
...mentpruner/MultiPartitionColumnsSegmentPruner.java 73.91% <ø> (ø)
... and 225 more

Help us with your feedback. Take ten seconds to tell us how you rate us.



/**
* Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format: move this header to top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@mcvsubbu
Copy link
Contributor

Overall, I am a plus 1 on more minion debuggability.
My view is that we should add APIs to minions similar to that of servers, and pull debugging info from there and display it.

We already have minion APIs (look for /debug endpoint in Tasks tab). This is for scheduler errors, and the scope of scheduler remains limited to the controller, hence thought it makes most sense as an API here

I don't see a minion debug REST endpoint in the code. Am I missing something?

Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saurabhd336 thanks a lot for your contribution. It has been hard to debug minions, so I totally understand your pain point. However, I have a few general questions:

  1. What is the use case where you think this will help? (we have some use cases where we saw problems, and instead of introducing one new API per use case, I am looking to make sure we do something that works for more than one. @sajjad-moradi can review from our standpoint).
  2. I know @npawar referred to a debug API on the minion, but I could not locate one. Please correct me if I am wrong. I think we should issue an API to the minions as well (liek we do with servers) and get the latest status form them and display it in the controller.

@ApiParam(value = "Table name with type", required = true) @PathParam("tableNameWithType")
String tableNameWithType) {
BaseTaskGeneratorInfo
taskGeneratorMostRecentRunInfo = _taskManagerStatusCache.fetchTaskGeneratorInfo(tableNameWithType, taskType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the table is not mastered on this controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saurabhd336 thanks a lot for your contribution. It has been hard to debug minions, so I totally understand your pain point. However, I have a few general questions:

  1. What is the use case where you think this will help? (we have some use cases where we saw problems, and instead of introducing one new API per use case, I am looking to make sure we do something that works for more than one. @sajjad-moradi can review from our standpoint).
  2. I know @npawar referred to a debug API on the minion, but I could not locate one. Please correct me if I am wrong. I think we should issue an API to the minions as well (liek we do with servers) and get the latest status form them and display it in the controller.

Hey Subbu, this is the one for Minion tasks: /tasks/task/{taskName}/debug. In the list of subtasks returned, there's an "info" field which contains the exception (if any) encountered by this task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch about the lead controller! this won't work as is, we'll have to think of how to handle that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 2. I feel it's not very convenient when one auto scales minion workers frequently based on pending tasks. The debug API would be either not reachable when no workers are around or get no past task states to check when new workers are added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we have helix APIs for that, right? Has that avenue been fully explored? We are migrating to Helix 1.x) very soon (we are getting to production with that), so please check in 1.x pages what is available for task management and debugging, thanks. We will also do the same.
cc: @sajjad-moradi

Copy link
Contributor

@sajjad-moradi sajjad-moradi Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts:

  • generating task configurations which happens in pinot code
  • submitting the task via TaskDriver which is in Helix code

Subbu has a valid point that helix should provide some info on failure on the 2nd part - task submission. The first part though is before reaching helix and it's good if we have an API to capture issues in task generation in pinot code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also scheduling tasks can be done by either:

  • periodic task generator job on lead controller, or
  • task REST endpoint which can be on any controller

So this means the approach in this PR doesn't work as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. The status can indeed be stored across different controllers. I was trying the following two approaches
Upon hitting this API,

  1. The controller makes a HTTP call to every other controller to get their respective in-memory debug statuses, waits for the responses and finally responds back with a list of all the responses.
  2. Use the helix messaging framework to broadcast a message to each controller, and via the AsyncCallback, collate all the replies and respond back (https://github.com/saurabhd336/pinot/pull/9/files)

I'll test and evaluate both approaches and update this PR accordingly.

@saurabhd336
Copy link
Contributor Author

@mcvsubbu @sajjad-moradi @npawar @klsince I've tried addressing the issue with the debug data being present across different controllers by letting the main controller query every other controller for the status data.
Please have a look at the updated PR

Copy link
Contributor

@klsince klsince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* @param ts A timestamp
*/
public void addSuccessRunTs(long ts) {
_mostRecentSuccessRunTS.add(ts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: any need to keep _mostRecentSuccessRunTS sorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifies the eviction is all. Also the APi response is easier to work with (the most recent timestamp is the one of the top)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this indicate a successful run (completion) of the task, or a successful generation of the task? If the task fails in minion, does your API address it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only indicates the successful generation of a the task. This API is exclusively for task generation

pinotTaskConfigs = taskGenerator.generateTasks(enabledTableConfigs);
for (TableConfig tableConfig : enabledTableConfigs) {
try {
_taskManagerStatusCache.saveTaskGeneratorInfo(tableConfig.getTableName(), taskGenerator.getTaskType(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a saveTaskGeneratorInfoQuietly() to save the try-catch block for brevity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the in-memory storage, I don't see any exceptions that can be thrown. Have removed the try catch block

public class InMemoryTaskManagerStatusCache extends TaskManagerStatusCache<TaskGeneratorMostRecentRunInfo> {

private static class TaskGeneratorCacheKey {
String _tableNameWithType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) make them final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

private final String _taskType;
private final String _tableNameWithType;
// the timestamp to error message map of the most recent several error runs
@Nonnull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) We don't usually use Nonnull. Everything without annotation are treated non-null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 35 to 36
* several error run
* message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor)

Suggested change
* several error run
* message.
* several error run messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 41 to 44
@VisibleForTesting
static final String MOST_RECENT_SUCCESS_RUN_TS = "mostRecentSuccessRunTs";
@VisibleForTesting
static final String MOST_RECENT_ERROR_RUN_MESSAGE = "mostRecentErrorRunMessage";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was being used when we had ZK based storage. Removed

_tableNameWithType = tableNameWithType;
_taskType = taskType;
// sort and keep the most recent several error messages
_mostRecentErrorRunMessage = new TreeMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we decided to not use ZNode to store the information, suggest simplifying it to only take tableNameWithType and taskType. We may change it when we decide to persist the info in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

private final String _tableNameWithType;
// the timestamp to error message map of the most recent several error runs
@Nonnull
private final TreeMap<Long, String> _mostRecentErrorRunMessage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final TreeMap<Long, String> _mostRecentErrorRunMessage;
private final TreeMap<Long, String> _mostRecentErrorRunMessages;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider storing timestamp as java.sql.Timestamp so that the response can show human readable time

try {
result.add(JsonUtils.stringToJsonNode(resp));
} catch (IOException e) {
LOGGER.error("Failed to parse ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

List<PinotTaskConfig> pinotTaskConfigs;
try {
pinotTaskConfigs = taskGenerator.generateTasks(enabledTableConfigs);
for (TableConfig tableConfig : enabledTableConfigs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a TODO here to separate the generated task/exception for each table. Currently even though some table doesn't have anything scheduled (e.g. enough tasks are scheduled from previous tables), we will still put success TS for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get this. This method is only called for enabledTableConfigs i.e. only for tables for which the particular task is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a discussion offline. Added the todo

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

public TreeMap<String, String> getMostRecentErrorRunMessages() {
TreeMap<String, String> result = new TreeMap<>();
_mostRecentErrorRunMessages.forEach((timestamp, error) -> result.put(
OffsetDateTime.ofInstant(Instant.ofEpochMilli(timestamp), ZoneId.of("UTC")).toString(), error));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor)

Suggested change
OffsetDateTime.ofInstant(Instant.ofEpochMilli(timestamp), ZoneId.of("UTC")).toString(), error));
OffsetDateTime.ofInstant(Instant.ofEpochMilli(timestamp), ZoneOffset.UTC).toString(), error));

We can also consider putting ZoneId.systemDefault() which might be more readable, but it might also be confusing when the client is in a different timezone from the server. IMO both ways have pros and cons, so your call

Copy link
Contributor Author

@saurabhd336 saurabhd336 Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel keeping everything UTC might be more user friendly especially when pinot clusters are deployed in different geographies. Keeping it UTC for now.

import java.util.function.Consumer;


public abstract class TaskManagerStatusCache<T extends BaseTaskGeneratorInfo> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing it to an interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Jul 22, 2022
@Jackie-Jiang Jackie-Jiang merged commit 463bf97 into apache:master Jul 22, 2022
}

// Call all controllers
List<InstanceConfig> controllers = _pinotHelixResourceManager.getAllControllerInstanceConfigs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to send request to all controllers. Just to the lead controller is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since POST '/tasks/schedule' API lets one schedule tasks for a given table on any controller (non leader included), we do need to query all controllers here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I was not clear. You can field the request in any controller (like you do). If that is not the lead for the table, then locate the leader controller (see the class PinotLeadControllerRestletResource for how to do that) and then forward the request to just the lead controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so the POST actually has any controller schedule the job for a table ? Hmm... maybe we should fix hat too. But I see what you are saying here. My bad.

@mcvsubbu
Copy link
Contributor

I had asked some questions about the use case you are tackling. Can you please answer those? How frequently are you generating tasks (in your use case)? Is a metric bump not enough for your use case? Can log processing help solving this use case?

@saurabhd336
Copy link
Contributor Author

@mcvsubbu

  1. @npawar @Jackie-Jiang ? I might have just very rough, possibly inaccurate numbers.

  2. I feel the need of a control plane level API within pinot to give an overall view into current and past state of minion tasks is of importance to us. Task generator being a key part of the entire minion task flow. While metrics can help to some extent, having details like failure stack traces etc might be difficult. This api avoids having to tally metrics and debug logs from a separate log processing system.

  3. I suppose it could. But integrating the log processing framework into pinot APIs themselves might be a bit of a challenge. Having a system table for these kind of usecases might be the right way forward, such that pinot itself can store and serve debug data and status metrics for each of the components / flows. Essentially move from in-memory storage of logs and metrics into the system table

@npawar @Jackie-Jiang to add more

@mcvsubbu
Copy link
Contributor

@mcvsubbu

  1. @npawar @Jackie-Jiang ? I might have just very rough, possibly inaccurate numbers.
  2. I feel the need of a control plane level API within pinot to give an overall view into current and past state of minion tasks is of importance to us. Task generator being a key part of the entire minion task flow. While metrics can help to some extent, having details like failure stack traces etc might be difficult. This api avoids having to tally metrics and debug logs from a separate log processing system.

I agree there is a need to debug minions. However this API does not provide any new insights other than what Helix already provides. So, let us work off of some concrete examples -- what is the use case in which you think this will help?

  1. I suppose it could. But integrating the log processing framework into pinot APIs themselves might be a bit of a challenge. Having a system table for these kind of usecases might be the right way forward, such that pinot itself can store and serve debug data and status metrics for each of the components / flows. Essentially move from in-memory storage of logs and metrics into the system table

Every installation has its own way of processing logs. Cloud installations typically pipe the logs into a log processing service to get insights into it. On-prem installations have either their own way of piping the logs, or ways to get onto the host and look for the logs.

@npawar @Jackie-Jiang to add more

The reason I am reluctant to add more APIs is that the more we add, the more we need to keep backward compatibility on. My question to @saurabhd336 is : Have you looked at what Helix has to offer? If so, what are the short-comings?

@npawar
Copy link
Contributor

npawar commented Jul 27, 2022

@mcvsubbu

  1. @npawar @Jackie-Jiang ? I might have just very rough, possibly inaccurate numbers.
  2. I feel the need of a control plane level API within pinot to give an overall view into current and past state of minion tasks is of importance to us. Task generator being a key part of the entire minion task flow. While metrics can help to some extent, having details like failure stack traces etc might be difficult. This api avoids having to tally metrics and debug logs from a separate log processing system.
  3. I suppose it could. But integrating the log processing framework into pinot APIs themselves might be a bit of a challenge. Having a system table for these kind of usecases might be the right way forward, such that pinot itself can store and serve debug data and status metrics for each of the components / flows. Essentially move from in-memory storage of logs and metrics into the system table

@npawar @Jackie-Jiang to add more

for 1: typically for users of RealtimeToOfflineTask, tasks get generated hourly. For SegmentGenerationAndPushTask, it can be way more frequently, depending on the the number of times files are generated in the source dir. MergeRollupTasks, are less frequent, but still several a day. There might be others, but this is what we see most commonly setup by users in oss. In a typical setup, all of these would be configured. It becomes quite confusing for users to have to find the exact exception in the logs, especially because some logs are in controller (scheduler related) and some in minion (task execution related). This API will help us make the feedback loop quicker, especially when we add this into the new Minion tab on the Pinot Admin UI

Regarding info already in Helix, these scheduler related exceptions are not present in the Helix generated metadata.

/**
* Base abstract class for task generator info.
*/
public abstract class BaseTaskGeneratorInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add implements Serializable

* @param ts A timestamp
*/
public void addSuccessRunTs(long ts) {
_mostRecentSuccessRunTS.add(ts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this indicate a successful run (completion) of the task, or a successful generation of the task? If the task fails in minion, does your API address it?

@mcvsubbu
Copy link
Contributor

@mcvsubbu

  1. @npawar @Jackie-Jiang ? I might have just very rough, possibly inaccurate numbers.
  2. I feel the need of a control plane level API within pinot to give an overall view into current and past state of minion tasks is of importance to us. Task generator being a key part of the entire minion task flow. While metrics can help to some extent, having details like failure stack traces etc might be difficult. This api avoids having to tally metrics and debug logs from a separate log processing system.
  3. I suppose it could. But integrating the log processing framework into pinot APIs themselves might be a bit of a challenge. Having a system table for these kind of usecases might be the right way forward, such that pinot itself can store and serve debug data and status metrics for each of the components / flows. Essentially move from in-memory storage of logs and metrics into the system table

@npawar @Jackie-Jiang to add more

for 1: typically for users of RealtimeToOfflineTask, tasks get generated hourly. For SegmentGenerationAndPushTask, it can be way more frequently, depending on the the number of times files are generated in the source dir. MergeRollupTasks, are less frequent, but still several a day. There might be others, but this is what we see most commonly setup by users in oss. In a typical setup, all of these would be configured. It becomes quite confusing for users to have to find the exact exception in the logs, especially because some logs are in controller (scheduler related) and some in minion (task execution related). This API will help us make the feedback loop quicker, especially when we add this into the new Minion tab on the Pinot Admin UI

Regarding info already in Helix, these scheduler related exceptions are not present in the Helix generated metadata.

I still think log processing should be the answer to this (and other similar PRs that may come up in the future). We should not be adding a new API for every error condition we may encounter in the system (and log something).

@siddharthteotia , @npawar , @Jackie-Jiang , @snleee, @kishoreg, @mayankshriv what do you think? If the PMCs don't have any objection to this then I can live with this, but I am willing to bet that more such PRs will come up because logs are difficult to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants