Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set MDC so that one can route minion task logs to separate files cleanly #9400

Merged
merged 3 commits into from
Sep 15, 2022

Conversation

klsince
Copy link
Contributor

@klsince klsince commented Sep 14, 2022

Allow one to route all logs (including exception stack trace) emit during a minion task execution to a separate file, so that the task logs can be accessed via the download/log restful API added recently. Using MDC, so that we can capture the task logs completely, assuming the task is done by a single thread on minion worker which is the case today.

Added an example log4j2.xml for reference.

Alternatively, we might search logs from the default log file with the taskId, but that wouldn't work cleanly: 1) not all task logs are tagged with taskId; 2) the exception stacktrace would be missed; 3) costly to search log msgs from a large log file every time, and potentially from archived log files.

@klsince
Copy link
Contributor Author

klsince commented Sep 14, 2022

example:

➜  pinot-oss git:(allow_separate_task_logs) ✗ ls -l ../logs
total 16736
-rw-r--r--  1 xiaobing  staff     5966 Sep 14 14:51 minionTaskLogs-Task_RealtimeToOfflineSegmentsTask_10102576-6dd0-416b-80e0-d25cc9ab65f8_1663192300019_0.log
-rw-r--r--  1 xiaobing  staff     5966 Sep 14 14:52 minionTaskLogs-Task_RealtimeToOfflineSegmentsTask_198d2193-89ce-4c4c-a576-8fa5b04d702d_1663192330009_0.log
...
-rw-r--r--  1 xiaobing  staff     6845 Sep 14 14:53 minionTaskLogs-default.log
-rw-r--r--  1 xiaobing  staff  7757824 Sep 14 15:13 pinot-all.log
➜  pinot-oss git:(allow_separate_task_logs) ✗ curl -X GET "http://localhost:9000/loggers/instances/Minion_192.168.0.106_6000/download?filePath=minionTaskLogs-Task_RealtimeToOfflineSegmentsTask_10102576-6dd0-416b-80e0-d25cc9ab65f8_1663192300019_0.log" -H "accept: application/octet-stream" 
...
2022/09/14 14:51:40.308 INFO [TaskFactoryRegistry] [TaskStateModelFactory-task_thread-0] Start running
...
2022/09/14 14:51:40.419 ERROR [TaskFactoryRegistry] [TaskStateModelFactory-task_thread-0] Caught exception while executing task: Task_RealtimeToOfflineSegmentsTask_10102576-6dd0-416b-80e0-d25cc9ab65f8_1663192300019_0
java.lang.IllegalStateException: Failed to find table config for table: githubEvents_OFFLINE
	at com.google.common.base.Preconditions.checkState(Preconditions.java:518) ~[guava-20.0.jar:?]
...
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
2022/09/14 14:51:40.419 INFO [TaskFactoryRegistry] [TaskStateModelFactory-task_thread-0] Task: Task_RealtimeToOfflineSegmentsTask_10102576-6dd0-416b-80e0-d25cc9ab65f8_1663192300019_0 completed in: 111ms

@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2022

Codecov Report

Merging #9400 (893870f) into master (b777a62) will increase coverage by 0.11%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #9400      +/-   ##
============================================
+ Coverage     69.67%   69.78%   +0.11%     
+ Complexity     4787     4708      -79     
============================================
  Files          1885     1890       +5     
  Lines        100419   100656     +237     
  Branches      15283    15327      +44     
============================================
+ Hits          69966    70245     +279     
+ Misses        25490    25430      -60     
- Partials       4963     4981      +18     
Flag Coverage Δ
integration1 26.02% <100.00%> (-0.02%) ⬇️
integration2 24.78% <100.00%> (+0.03%) ⬆️
unittests1 66.99% <ø> (+0.06%) ⬆️
unittests2 15.33% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../pinot/minion/taskfactory/TaskFactoryRegistry.java 79.10% <100.00%> (+0.64%) ⬆️
...ntroller/helix/core/minion/TaskMetricsEmitter.java 34.88% <0.00%> (-51.17%) ⬇️
.../filter/predicate/InPredicateEvaluatorFactory.java 72.80% <0.00%> (-9.61%) ⬇️
...he/pinot/segment/local/segment/store/IndexKey.java 65.00% <0.00%> (-5.00%) ⬇️
...lter/predicate/NotInPredicateEvaluatorFactory.java 78.35% <0.00%> (-4.48%) ⬇️
...lix/core/realtime/PinotRealtimeSegmentManager.java 76.43% <0.00%> (-4.19%) ⬇️
...lix/core/minion/PinotHelixTaskResourceManager.java 35.15% <0.00%> (-2.73%) ⬇️
...r/validation/RealtimeSegmentValidationManager.java 71.62% <0.00%> (-2.71%) ⬇️
...ders/forward/VarByteChunkMVForwardIndexReader.java 77.57% <0.00%> (-1.87%) ⬇️
...e/pinot/segment/local/io/util/PinotDataBitSet.java 95.62% <0.00%> (-1.46%) ⬇️
... and 39 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang Jackie-Jiang added release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) feature labels Sep 15, 2022
</Route>
</Routes>
<!-- Created appender TTL -->
<IdlePurgePolicy timeToLive="15" timeUnit="minutes"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understood the rollover if > 19.5MB or rollover if > 50 files.
What's the 15 minutes TTL? will that be enough?
Also is 50 max good enough? We could have a lot more tasks running at once rt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per the doc: 15min TTL is used to close the dynamically created appender, after it's idle for up to TTL.

as to purge old log files, there are multiple options: by max count as in the example here or time based (which may be more suitable for prod use, e.g. to keep log files as long as the task info is kept in helix, which is 24hr by default)

@Jackie-Jiang Jackie-Jiang merged commit 41c4aca into apache:master Sep 15, 2022
@klsince klsince deleted the allow_separate_task_logs branch September 15, 2022 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants