Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-228] Standardize logging modules #743

Merged
merged 2 commits into from
Jan 5, 2023

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Jan 2, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

Standardize on using log4j 2 for logging. Currently the logging modules are pulled in from Spark to use log4j 1 or 2 depending on the Spark version. The common module and flink module have no logging available to it. Both Flink and Spark use Log4j 2.x in their latest versions.

Main points:

  • Add log4j-1.2-api as a provided dependency for all modules. Currently all the logging uses the log4j 1.x API, so this makes that compatible while still being able to run log4j 2.x (which is what actually happens at runtime with new flink and spark). Long term the logging should probably be switched to use SLF4J so the underlying implementation doesn't matter as much (this is also what flink and spark actually use I think)
  • Add log4j-core and log4j-slf4j-impl as test dependencies so that log4j 2.x gets used for logging for all tests
  • Add a log4j2.properties file as a test resource to all projects so logs get set to target/unit-tests.log to reduce the verbosity of tests when running locally and for finding what tests fail in the CI logs. This was copied from how the Spark repo does it.

How was this patch tested?

Just logging changes, existing tests.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

@jiayuasu
Copy link
Member

jiayuasu commented Jan 3, 2023

@Kimahriman
I don't think it is a good idea to include logging dependencies in Sedona. Since Spark and Flink already package their own versions of log4j or slf4j, including these dependencies by ourselves are likely to create conflicts in the future. This will introduce additional complexity to the project management and cause potential bugs.

For the common module, I think we should just use the regular java exception mechanism and let sedona-spark and sedona-flink to handle exceptions using log4j/slf4j packaged in Spark and Flink.

Please correct me if I am wrong :-)

@Kimahriman
Copy link
Contributor Author

Kimahriman commented Jan 4, 2023

@Kimahriman I don't think it is a good idea to include logging dependencies in Sedona. Since Spark and Flink already package their own versions of log4j or slf4j, including these dependencies by ourselves are likely to create conflicts in the future. This will introduce additional complexity to the project management and cause potential bugs.

Nothing is added as a compile dependency, most of this is to standardize what framework gets used during the tests. For example, several of the test base classes have log4j 1.x code to reduce the logging level for various packages. I don't think these do anything for current Flink or Spark versions, because those both use log4j 2.x as the logging implementation, but they include log4j-1.2-api as a dependency so the log4j 1.x code still compiles. I'm not sure what the limitations of this bridge API are.

I mentioned in the description, but the real thing that needs to be done is to migrate all logging to SLF4J. Currently if Spark or Flink dropped the log4j-1.2-api dependency, Sedona would no longer work I believe. This is kind of a step towards that, with two main parts:

  • The two test dependencies (log4j-core and log4j-slf4j-impl) are added to make sure all tests run with log4j 2.x, so that we can use a log4j2.properties file to configure them
  • log4j-1.2-api is added purely for older versions of Spark since we exclude log4j 1.x dependencies to make sure 2.x is used for the tests. But because this provides the log4j 1.x API that the Sedona code currently uses, it has to be a provided dependency and not a test dependency. If Sedona only used SLF4J for logging this wouldn't be necessary, as the 1.x to 2.x change would be seamless and we wouldn't need to exclude the SLF4J dependency from Spark.

Also part of this is that currently the Flink module doesn't support any form of logging during the tests, because the Flink dependencies don't provide any of the logging modules like Spark does for the dependencies that are currently included.

With this going forward, we could easily enable logging in the common and flink modules by adding the slf4j-api as a provided dependency. This would enable the code to compile, and then with the two test dependencies mentioned above, common module tests could properly output logs, and at runtime the slf4j-api would be provided by either Flink or Spark directly. Especially if/as more of the non-Spark related core and sql module code get ported to the common module.

Alternative options to change less things but still work toward that goal:

  • Only include these dependencies for Flink so that the Flink tests can actually produce logs
  • Either don't worry about configuring logging for the Spark tests across all the versions, or also add a log4j.properties file to mirror the log4j2.properties file so both old and new versions of Spark would have similar settings applied (either for reducing verbosity of certain modules or do what I have done here which is output the unit test logs to a file instead of the console

@jiayuasu jiayuasu added this to the sedona-1.4.0 milestone Jan 5, 2023
@jiayuasu
Copy link
Member

jiayuasu commented Jan 5, 2023

@Kimahriman Thank you for your great explanation!

@jiayuasu jiayuasu merged commit d2d5336 into apache:master Jan 5, 2023
@jiayuasu jiayuasu deleted the standardize-logging branch January 5, 2023 05:57
@jiayuasu
Copy link
Member

jiayuasu commented Jan 5, 2023

@Kimahriman One thing I just realized: We probably should just print logging to the console (stdout and stderr) directly. Otherwise the users won't be able to see the logging unless they config log4j explicitly. It will be better if they can see those WARN/ERROR message directly. What do you think?

@Kimahriman
Copy link
Contributor Author

Do you mean for tests or for real use?

For tests the logs should all go to target/unit-tests.log. Copied this from Spark so the stdout/err of the tests is less verbose and mostly just shows passed/failed tests. Some of the CI test logs were so verbose it's hard to find what test actually failed. I think the python tests are still probably pretty verbose.

For real use, the logs should keep working exactly as they were automatically using the logging frameworks provided by Flink and Spark. If someone were to try to use the common module directly without Flink or Spark, then they would need to provide their own logging configs.

@jiayuasu
Copy link
Member

jiayuasu commented Jan 5, 2023

Great explanation! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants