Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDATreeLearner: free GPU memory in destructor if any allocated #4963

Merged
merged 6 commits into from
Feb 20, 2022
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions src/treelearner/cuda_tree_learner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,43 @@ CUDATreeLearner::CUDATreeLearner(const Config* config)
}

CUDATreeLearner::~CUDATreeLearner() {
#pragma omp parallel for schedule(static, num_gpu_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a static schedule, and num_gpu_ chunk size, I think there will be only 1 thread being used. So I can we simply remove this line?

Copy link
Contributor Author

@denmoroz denmoroz Feb 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994
To be honest I've just copy-pasted code from here which uses omp parallel, so I decided that probably it is done intentionally. My expectations were that cleanup for each GPU device will be executed in parallel for efficiency (if num_gpu_ > 1). Do you think we should actually remove this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After experimenting, I find that schedule(static, num_gpu_) will only result in a single thread. With num_gpu_ block size, and with num_gpu_ total iterations. So this would not parallelize the for loop.
I think it is ok to keep the omp parallel in this PR. And we can fix this with another PR. Thank you!


for (int device_id = 0; device_id < num_gpu_; ++device_id) {
CUDASUCCESS_OR_FATAL(cudaSetDevice(device_id));

if (device_features_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_features_[device_id]));
}

if (device_gradients_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_gradients_[device_id]));
}

if (device_hessians_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_hessians_[device_id]));
}

if (device_feature_masks_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_feature_masks_[device_id]));
}

if (device_data_indices_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_data_indices_[device_id]));
}

if (sync_counters_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(sync_counters_[device_id]));
}

if (device_subhistograms_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_subhistograms_[device_id]));
}

if (device_histogram_outputs_[device_id] != NULL) {
CUDASUCCESS_OR_FATAL(cudaFree(device_histogram_outputs_[device_id]));
}
}
}


Expand Down