Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it #14751

Merged
merged 11 commits into from
Jan 22, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 12, 2024

Description

JSON_PATH_TEST and ROW_CONVERSION_TEST were not using the CUDF_TEST_PROGRAM_MAIN, and thus were not picking up the GTEST_CUDF_RMM_MODE env variable during nightly testing.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner January 12, 2024 01:02
Copy link

copy-pr-bot bot commented Jan 12, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 12, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Jan 12, 2024

CC @davidwendt

@vyasr
Copy link
Contributor

vyasr commented Jan 12, 2024

/ok to test

@vyasr vyasr added bug Something isn't working non-breaking Non-breaking change labels Jan 12, 2024
@vyasr
Copy link
Contributor

vyasr commented Jan 12, 2024

It looks like with the change to using a pool you may need to up the GPU percentages used for both Parquet and JSON. Maybe try 50 first and see if that works? I don't know the tests well enough to have any intuition for the data sizes.

@davidwendt
Copy link
Contributor

It looks like with the change to using a pool you may need to up the GPU percentages used for both Parquet and JSON. Maybe try 50 first and see if that works? I don't know the tests well enough to have any intuition for the data sizes.

That is strange since neither of these two are part of those tests.
Maybe the JSON_PATH_TEST and ROW_CONVERSION_TEST need their own GPU settings instead?

@etseidl
Copy link
Contributor Author

etseidl commented Jan 12, 2024

That is strange since neither of these two are part of those tests.
Maybe the JSON_PATH_TEST and ROW_CONVERSION_TEST need their own GPU settings instead?

Did some local profiling, and it looks like ROW_CONVERSION_TEST blows up the memory pool right at the end (the last 1 or two tests it runs). I see that ROW_CONVERSION_TEST ends just after the two tests that have the memory error. If I run RCT with the cuda mr, I don't see the mem explosion. I'll have to look at the ManyStrings test...maybe it's fragmenting the pool memory?

@davidwendt
Copy link
Contributor

That is strange since neither of these two are part of those tests.
Maybe the JSON_PATH_TEST and ROW_CONVERSION_TEST need their own GPU settings instead?

Did some local profiling, and it looks like ROW_CONVERSION_TEST blows up the memory pool right at the end (the last 1 or two tests it runs). I see that ROW_CONVERSION_TEST ends just after the two tests that have the memory error. If I run RCT with the cuda mr, I don't see the mem explosion. I'll have to look at the ManyStrings test...maybe it's fragmenting the pool memory?

The ROW_CONVERSION_TEST was recently moved from the JNI source tree to libcudf. My understanding is the move is temporary until the build team can figure out some CUDA 12 build issues within Spark. Given that, I would be ok with changes here to make it use less resources (and run faster as well).

@etseidl
Copy link
Contributor Author

etseidl commented Jan 12, 2024

There are a bunch of things going on here. One is that the random number generator isn't seeded, so the order of tests, or selecting a subset, can lead to different results. For instance, on my machine, if I run just the RowToColumnTests suite, I get an OOM error in ManyStrings regardless of which memory resource I use. If I run the full set of tests, or just ManyStrings, no OOM. If I add a srand() call, then no OOM.

But the real culprit is the RowToColumnTests.Biggest test which starts the explosive growth of the pooled MR.

The ROW_CONVERSION_TEST was recently moved from the JNI source tree to libcudf. My understanding is the move is temporary until the build team can figure out some CUDA 12 build issues within Spark. Given that, I would be ok with changes here to make it use less resources (and run faster as well).

Well, if that's the case should we just revert to the default MR for that test? Or modify the tests to use less RAM?

@davidwendt
Copy link
Contributor

davidwendt commented Jan 12, 2024

Well, if that's the case should we just revert to the default MR for that test? Or modify the tests to use less RAM?

I'd prefer to modify the tests to use less RAM. How feasible is that?

@vyasr
Copy link
Contributor

vyasr commented Jan 12, 2024

That is strange since neither of these two are part of those tests. Maybe the JSON_PATH_TEST and ROW_CONVERSION_TEST need their own GPU settings instead?

It could be (and sounds like) the case that the tests changed in this PR are now using up enough memory that other tests (the ones I mentioned) are failing. So yes, we could change the setting for those. OTOH if reducing the memory footprint of the ROW_CONVERSION_TEST is feasible that sounds even better.

@etseidl
Copy link
Contributor Author

etseidl commented Jan 12, 2024

@ttnghia can you take a look and see if it's ok to reduce the number of rows or columns in the big string tests?

@etseidl
Copy link
Contributor Author

etseidl commented Jan 12, 2024

I reduced the row counts for two tests, and now the entire test stays within 6GB (when starting with a 3GB allocation).

@ttnghia
Copy link
Contributor

ttnghia commented Jan 12, 2024

@ttnghia can you take a look and see if it's ok to reduce the number of rows or columns in the big string tests?

I'm not sure about this. @hyperbolic2346 what do you think?

@davidwendt
Copy link
Contributor

/ok to test

@@ -934,6 +935,8 @@ TEST_F(RowToColumnTests, BigStrings)

TEST_F(RowToColumnTests, ManyStrings)
{
// this test is very sensitive, so seed the random number generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensitive in what way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prone to blowing up if too many instances of the largest string are used. Seeding the RNG prevents that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hyperbolic2346 did you want a change to the comment here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice. If I have the question others will as well.

@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

/ok to test

@hyperbolic2346
Copy link
Contributor

@ttnghia can you take a look and see if it's ok to reduce the number of rows or columns in the big string tests?

I'm not sure about this. @hyperbolic2346 what do you think?

I think it is fine. The test sizes are somewhat arbitrarily chosen. It needs to be "large enough" to uncover bugs. Given this is temporary and 3 gigs is most likely large enough I am fine with the change.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@davidwendt
Copy link
Contributor

/ok to test

@davidwendt
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit b1468a5 into rapidsai:branch-24.02 Jan 22, 2024
66 of 67 checks passed
@etseidl etseidl deleted the gtest_mains branch January 22, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants