Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock @ multiGPU caffe #3279

Closed
linnanwang opened this issue Nov 4, 2015 · 25 comments
Closed

deadlock @ multiGPU caffe #3279

linnanwang opened this issue Nov 4, 2015 · 25 comments
Assignees

Comments

@linnanwang
Copy link

Hello Guys,

It comes to my attention that the following scenario might happen:

prefetch_full_ might be empty while a solver is trying to retrieve batch from it, resulting in a wait deadlock. It is weird that there are items in prefetch_free_. From InternalThreadEntry in base_data_layer.cpp, it is prefetch_full_ pushing into the first element of prefetch_free_.

Any ideas why it is happening? Thank you.

@thatguymike
Copy link
Contributor

The data prefetch layer has an independent thread that takes from the free queue, loads the batch, and pushes to the full queue. So yes, the solver should be waiting until there is a batch to execute.

Are you seeing a deadlock in practice? If so, C++ or Python paths?

@linnanwang
Copy link
Author

yes, I saw it under c++ codes inside caffe. I'm implementing a new solver.
I saw prefetch_full is decreasing until it to be 0. Then there is a wait lock forever.

My question is, where is it possible to cause prefect_full not being populated? And the queue of prefetch_free is subsequently getting longer. Any thought on this? Thank you.

@linnanwang
Copy link
Author

I do see a solver pull the fetch_full more than prefetch_free in a iteration... This is causing the queue to deplete, any suggestions to resolve it?

@thatguymike
Copy link
Contributor

Queue depletion is the IO/decode bottleneck. This doesn't mean we don't have a bug somewhere in this code, but i haven't seen a deadlock if the data layer is initialized.

To see if you really have a deadlock, instrument the BasePrefetchingDataLayer::InternalThreadEntry code to see if it's waiting for real on a free prefetch buffer to fill.

@thatguymike
Copy link
Contributor

I managed to trigger a deadlock when messing with the prefetch size, so we do have a bug hiding here.

@linnanwang
Copy link
Author

Yes, it is because the implementation of blocking queue. The inter thread communication inside caffe with a thread wait mechanism, in my humble perspective, complicates a simple logic. From the engineering perspective, I prefer the simple while reliable one. Say using the barrier to do the synchronization. (not referring to the blocking queue, instead, the NVIDIA multiGPU caffe is using wait to sync among multiGPU). Never mind, I'm still need to figure out why another GPU conduct more forward than others.

@ronghanghu ronghanghu self-assigned this Nov 4, 2015
@ronghanghu ronghanghu added the bug label Nov 4, 2015
@ronghanghu
Copy link
Member

I shall look into this some time next week.

@linnanwang
Copy link
Author

Thank you, is this branch Caffe support multiGPUs? Thank you.

@ronghanghu
Copy link
Member

We apologize that we are going to be quite overwhelmed with CVPR submission in the next 48 hours. I'll address this issue after that.

@linnanwang
Copy link
Author

good luck with your paper

@thatguymike
Copy link
Contributor

The plot thickens. The deadlock comes between the interaction in the data reader thread and the solvers initializing. There were changes to the data loader and solver initialization in the final merge compared to the initial work @cypof and I did. I can see things stuck on the solvers not interacting with the sync queue.

See the code in DataReader::Body::InternalThreadEntry(), things get stuck for me in the first loop waiting for the solvers to come up. This code was inserted to try to solve the reproducibility issues and not allow the solvers to get out of sync. Somethings is busted in the handshake here.

Repro: set PREFETCH_COUNT to 8, 4 GPU solvers, AlexNet training prototxt. Hang right off the bat.

@linnanwang
Copy link
Author

I don't think using blocking queue design is a good idea here. The data prefetch thread actually got blocked by queue sync, and it only fetch data in the start of an iteration. Because data layer needs pull out a batch and put it back.

nonblocking queue is highly recommended. Since you guys are using boost, they have APIs.

@bhack
Copy link
Contributor

bhack commented Nov 7, 2015

@linnanwang See also an old discussion thread starting from #1568 (comment)

@ronghanghu
Copy link
Member

I am looking into this issue this weekend.

@linnanwang
Copy link
Author

@ronghanghu Thank you, but I fixed my particular problem now. So, it is okay to let the current Caffe stay the way it is.

@thatguymike
Copy link
Contributor

There is still very much a bug in this code. Looks like this is more or less from a cross merge issue between the original multi-gpu work and how the prefetch worked and the final committed code. This could also have been caused by a change in how the sub batches are handled in the final code.

The main issue is that the code check in assumes each solver loads a datareader and that is not the case as we only load a datareader per source as far as I can tell in the merged code. Moreover, there is a race in the initialization handshake.

All solver sync should be done at the solver level and nothing should be in the data loaders and it will break more complex multiple solver setups. e.g. in the case of the current code, all sync is already done at the P2P level.

It will take a little time to rewrite the solver and data_reader setup to fix this and make things more clear.

@ronghanghu
Copy link
Member

The main issue is that the code check in assumes each solver loads a datareader and that is not the case as we only load a datareader per source as far as I can tell in the merged code. Moreover, there is a race in the initialization handshake.

I see the issue. Thank you!

All solver sync should be done at the solver level and nothing should be in the data loaders and it will break more complex multiple solver setups. e.g. in the case of the current code, all sync is already done at the P2P level. It will take a little time to rewrite the solver and data_reader setup to fix this and make things more clear.

I am trying to find a solution, and to have discussions with other developers, which may take a while. Everyone are welcome to provide suggestions.

@linnanwang
Copy link
Author

But is this brach Caffe supports multiGPU? NVCaffe is the only multiGPU caffe branch I found so far.

@linnanwang
Copy link
Author

Hey @thatguymike ,

I saw you from NVIDIA. I'm not sure if you're aware of cuBLAS-XT? Do you know any guys from that team?

@ronghanghu
Copy link
Member

But is this brach Caffe supports multiGPU? NVCaffe is the only multiGPU caffe branch I found so far.

Yes, this brach Caffe supports multiGPU. Multi-GPU is merged in #2903.

@linnanwang
Copy link
Author

@ronghanghu If so, why there is no argument to specify GPUS in caffe executable?
say -gpus

@ronghanghu
Copy link
Member

The -gpu argument can take in multiple indices like -gpu 0,1,3 or -gpu all.

@linnanwang
Copy link
Author

I see. Thank you. I would recommend changing it to -gpus as it's more logic in this way.

@thatguymike
Copy link
Contributor

We had that debate in the original merge. The original multi-GPU work actually had "-gpus"

@shelhamer
Copy link
Member

Closing now that #4563 is in and this issue was raised with the old parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants