Data source identification breaks the loading of multiple instances of the same net #3108

beniz · 2015-09-22T13:01:03Z

Source identification at https://github.com/BVLC/caffe/blob/master/include/caffe/data_reader.hpp#L68 was introduced by bcc8f50 and leads to raising a fatal check at https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L98 whenever two nearly identical nets train concurrently (e.g. on a single GPU).

The problem occurs if two nets are trained concurrently and share:

layer names
data source

This typically occurs when training several nets with different layer parameters but identical source and layer names.

My current solution to this problem is to enrich the source identification routine with a hash of the running thread, but my understanding is that it might break the original detection of identical sources from within the same net. For this reason, I am not sharing a PR in order to gather more thoughts on this issue.

The text was updated successfully, but these errors were encountered:

mohomran · 2015-09-22T14:02:38Z

Different symptom but same underlying issue reported here: #3037

longjon · 2015-11-05T09:41:24Z

I can confirm that this is a bug in need of attention.

This breaks multiple nets using the same source (even, e.g., reloading a net from an interactive Python session).
The encoding of sources in strings is bogus and not necessary; e.g., ":" is a valid character in both layer names and file names, so spurious collisions are possible.

beniz · 2015-11-05T10:55:34Z

Then for the sake of discussion, here is my fix from last month: jolibrain@70fd4f7

One of the many reasons you may not want this in vanilla Caffe is that it requires a C++11 compiler.

tarunsharma1 · 2016-07-23T00:23:08Z

So I have the same issue...is there a fix yet?

beniz · 2016-07-23T08:21:21Z

@tarunsharma1 if you use the fork I maintain at https://github.com/beniz/caffe it should work fine. This fork remains up to date with master with a short delay. You'd need a C++11 compiler however.

tarunsharma1 · 2016-07-23T15:30:27Z

This is for other/new users who have the same issue and want a quick easy hack around it. This is not a permanent solution ->

Turns out that the issue is with opening the same lmdb twice irrespective of whether you use CPU or GPU. A quick fix is to make a copy of your lmdb and give it a different name and use these two different lmdb names in your two networks respectively.

Does not work
Net1 -> train_lmdb
Net2 -> train_lmdb

Works
Net1 -> train_lmdb
Net2 -> train_lmdb_copy

beniz · 2016-07-26T21:13:41Z

So FTR, I've tested my branch against #3037 and it doesn't fix the problem there. I believe this is because my fix uses the thread id, which fixes the issue when training multiple models from the same data source using multiple threads, not from the same inner model.

shelhamer · 2017-04-14T04:12:04Z

Fixed by the parallelism reformation in #4563

longjon added the bug label Nov 5, 2015

longjon mentioned this issue Nov 5, 2015

MNIST Autoencoder Example hangs on data layer lock #3037

Closed

paulfitz mentioned this issue Dec 5, 2015

mnist_autoencoder.prototxt lock bug #3373

Open

shelhamer changed the title ~~Data source identification breaks the training of multiple instances of the same net~~ Data source identification breaks the loading of multiple instances of the same net Mar 31, 2016

shelhamer mentioned this issue Mar 31, 2016

Open two networks in the same python script #3923

Closed

shelhamer assigned cypof Jul 23, 2016

shelhamer closed this as completed Apr 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data source identification breaks the loading of multiple instances of the same net #3108

Data source identification breaks the loading of multiple instances of the same net #3108

beniz commented Sep 22, 2015

mohomran commented Sep 22, 2015

longjon commented Nov 5, 2015

beniz commented Nov 5, 2015

tarunsharma1 commented Jul 23, 2016

beniz commented Jul 23, 2016

tarunsharma1 commented Jul 23, 2016

beniz commented Jul 26, 2016

shelhamer commented Apr 14, 2017

Data source identification breaks the loading of multiple instances of the same net #3108

Data source identification breaks the loading of multiple instances of the same net #3108

Comments

beniz commented Sep 22, 2015

mohomran commented Sep 22, 2015

longjon commented Nov 5, 2015

beniz commented Nov 5, 2015

tarunsharma1 commented Jul 23, 2016

beniz commented Jul 23, 2016

tarunsharma1 commented Jul 23, 2016

beniz commented Jul 26, 2016

shelhamer commented Apr 14, 2017