Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault with hybrid Overfeat-Alexnet architecture for large input images #1260

Closed
andreesteva opened this issue Oct 10, 2014 · 22 comments

Comments

@andreesteva
Copy link

I am trying to implement a Krizhevsky net with the last 3 fully connected layers converted to overfeat's 1x1 convolution layers. I can get it to run with 256x256 input images up to 315x315 input images. Ultimately I'd like for the net to accept 400x640 sized images, but at the moment anything greater than 315x315 causes a segmentation fault.

Does anyone know what might be happening? Below are the solver.prototxt and train_val_segfault.prototxt that I'm using.

Solver.prototxt

net: "models/bvlc_reference_caffenet/train_val.prototxt"

net: "train_val_segfault.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 450000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000

snapshot_prefix: "models/bvlc_reference_caffenet/caffenet_train"

snapshot_prefix: "randcaffenet"
solver_mode: GPU

Tra_val_segfault.prototxt:
name: "CaffeNet"
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_train_leveldb"
batch_size: 50
}

transform_param {

crop_size: 227

mean_file: "random_image_mean.binaryproto"

mirror: true

}

include: { phase: TRAIN }
}
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_val_leveldb"
batch_size: 50
}

transform_param {

crop_size: 227

mean_file: "random_image_mean.binaryproto"

mirror: false

}

include: { phase: TEST }
}
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu1"
type: RELU
bottom: "conv1"
top: "conv1"
}
layers {
name: "pool1"
type: POOLING
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm1"
type: LRN
bottom: "pool1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "norm1"
top: "conv2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm2"
type: LRN
bottom: "pool2"
top: "norm2"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "norm2"
top: "conv3"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "conv4"
type: CONVOLUTION
bottom: "conv3"
top: "conv4"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu4"
type: RELU
bottom: "conv4"
top: "conv4"
}
layers {
name: "conv5"
type: CONVOLUTION
bottom: "conv4"
top: "conv5"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu5"
type: RELU
bottom: "conv5"
top: "conv5"
}
layers {
name: "pool5"
type: POOLING
bottom: "conv5"
top: "pool5"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "fc6-conv"
type: CONVOLUTION
bottom: "pool5"
top: "fc6-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu6"
type: RELU
bottom: "fc6-conv"
top: "fc6-conv"
}
layers {
name: "drop6"
type: DROPOUT
bottom: "fc6-conv"
top: "fc6-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc7-conv"
type: CONVOLUTION
bottom: "fc6-conv"
top: "fc7-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu7"
type: RELU
bottom: "fc7-conv"
top: "fc7-conv"
}
layers {
name: "drop7"
type: DROPOUT
bottom: "fc7-conv"
top: "fc7-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc8-conv"
type: CONVOLUTION
bottom: "fc7-conv"
top: "fc8-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 1000
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "fc8-conv"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}
layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "fc8-conv"
bottom: "label"
top: "loss"
}

@sguada
Copy link
Contributor

sguada commented Oct 11, 2014

Probably running out of memory try to reduce the batch_size

On Friday, October 10, 2014, andreesteva notifications@github.com wrote:

I am trying to implement a Krizhevsky net with the last 3 fully connected
layers converted to overfeat's 1x1 convolution layers. I can get it to run
with 256x256 input images up to 315x315 input images. Ultimately I'd like
for the net to accept 400x640 sized images, but at the moment anything
greater than 315x315 causes a segmentation fault.

Does anyone know what might be happening? Below are the solver.prototxt
and train_val_segfault.prototxt that I'm using.

Solver.prototxt
#net: "models/bvlc_reference_caffenet/train_val.prototxt"
net: "train_val_segfault.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 450000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
#snapshot_prefix: "models/bvlc_reference_caffenet/caffenet_train"
snapshot_prefix: "randcaffenet"
solver_mode: GPU

Tra_val_segfault.prototxt:
name: "CaffeNet"
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_train_leveldb"
batch_size: 50
}
transform_param { crop_size: 227 mean_file:
"random_image_mean.binaryproto" mirror: true }

include: { phase: TRAIN }
}
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_val_leveldb"
batch_size: 50
}
transform_param { crop_size: 227 mean_file:
"random_image_mean.binaryproto" mirror: false }

include: { phase: TEST }
}
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu1"
type: RELU
bottom: "conv1"
top: "conv1"
}
layers {
name: "pool1"
type: POOLING
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm1"
type: LRN
bottom: "pool1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "norm1"
top: "conv2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm2"
type: LRN
bottom: "pool2"
top: "norm2"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "norm2"
top: "conv3"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "conv4"
type: CONVOLUTION
bottom: "conv3"
top: "conv4"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu4"
type: RELU
bottom: "conv4"
top: "conv4"
}
layers {
name: "conv5"
type: CONVOLUTION
bottom: "conv4"
top: "conv5"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu5"
type: RELU
bottom: "conv5"
top: "conv5"
}
layers {
name: "pool5"
type: POOLING
bottom: "conv5"
top: "pool5"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "fc6-conv"
type: CONVOLUTION
bottom: "pool5"
top: "fc6-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu6"
type: RELU
bottom: "fc6-conv"
top: "fc6-conv"
}
layers {
name: "drop6"
type: DROPOUT
bottom: "fc6-conv"
top: "fc6-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc7-conv"
type: CONVOLUTION
bottom: "fc6-conv"
top: "fc7-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu7"
type: RELU
bottom: "fc7-conv"
top: "fc7-conv"
}
layers {
name: "drop7"
type: DROPOUT
bottom: "fc7-conv"
top: "fc7-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc8-conv"
type: CONVOLUTION
bottom: "fc7-conv"
top: "fc8-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 1000
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "fc8-conv"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}
layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "fc8-conv"
bottom: "label"
top: "loss"
}


Reply to this email directly or view it on GitHub
#1260.

Sergio

@sunbaigui
Copy link

I'm wonder if this configure will help to improve accuracy or speed or memory usage?

@andreesteva
Copy link
Author

@sguada @sunbaigui
The purpose of this configuration is to reduce memory overhead. I would like to scale this up to 400x640 sized input images. At the moment, the segmentation fault is not due to memory issues. For example, using this config I can pass 315x315 = 99225 sized images, but not 256 x 300 = 76800.

@sguada
Copy link
Contributor

sguada commented Oct 13, 2014

@andreesteva follow the questions caffe-users mailing list

I don't think there is any bug here, since it works with 315x315 images, so probably there is a problem of your prototxt definition, for instance when you change the size of the images, do you change the mean_file accordingly? Otherwise you should use mean_values instead #1070.

@sguada sguada closed this as completed Oct 13, 2014
@andreesteva
Copy link
Author

@sguada
There must be a bug. When I run this on an older version of caffe (from the August time frame), it works just fine. This leads me to believe that its not a memory problem, nor a prototxt definition problem. I am changing the mean_file accordingly, and I still encounter the same segmentation fault. Here are the steps that I am taking:

On old caffe version:
generate random image buffers using matlab: size 480x640 (Success)
create leveldb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (success)

On new caffe version, with cudnn
generate random image buffers using matlab: size 480x640 (Success)
create lmdb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (segfault)

From caffe version to version, I've encountered times when the examples haven't been fully updated (i.e. the argument syntax for creating leveldb files changed). Could it possibly be due to that?

If helpful, I can upload a package to github which can allow you to recreate the error.

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

Could you try new caffe version without cudnn?
And yeah upload the log file or further info to help to clarify your issue.
It can still be a problem of padding misalignment.

Sergio

2014-10-17 14:28 GMT-07:00 andreesteva notifications@github.com:

@sguada https://github.com/sguada
There must be a bug. When I run this on an older version of caffe (from
the August time frame), it works just fine. This leads me to believe that
its not a memory problem, nor a prototxt definition problem. I am changing
the mean_file accordingly, and I still encounter the same segmentation
fault. Here are the steps that I am taking:

On old caffe version:
generate random image buffers using matlab: size 480x640 (Success)
create leveldb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (success)

On new caffe version, with cudnn
generate random image buffers using matlab: size 480x640 (Success)
create lmdb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (segfault)

From caffe version to version, I've encountered times when the examples
haven't been fully updated (i.e. the argument syntax for creating leveldb
files changed). Could it possibly be due to that?

If helpful, I can upload a package to github which can allow you to
recreate the error.


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

I'll try a version without cudnn

In the meantime, I've uploaded the code I used for the old caffe and the new caffe version to
https://github.com/andreesteva/caffe-memory-test

It includes all the training and validation images already, as well as the leveldb and lmdb files used.

Copy either folder into the folder $CAFFE_ROOT/examples/ and then run
bash run_training.sh solver_alexoverfeat.prototxt

To recreate the error I'm seeing.

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

I'm sorry, but I'm not planning to recreate your error, can you just post
the log?

Sergio

2014-10-17 15:32 GMT-07:00 andreesteva notifications@github.com:

I'll try a version without cudnn

In the meantime, I've uploaded the code I used for the old caffe and the
new caffe version to
https://github.com/andreesteva/caffe-memory-test

It includes all the training and validation images already, as well as the
leveldb and lmdb files used.

Copy either folder into the folder $CAFFE_ROOT/examples/ and then run
bash run_training.sh solver_alexoverfeat.prototxt

To recreate the error I'm seeing.


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

Sure - is this a file that is generated somewhere or just the stdout output when you run caffe?

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

It is the output generated when you run caffe, but you can redirect it with:
caffe ... 2> log.file

Sergio

2014-10-17 15:47 GMT-07:00 andreesteva notifications@github.com:

Sure - is this a file that is generated somewhere or just the stdout
output when you run caffe?


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

The segfault is not very informative, can you make clean then recompile
with DEBUG:=1 in the Makefile.config, and double check that you have done
export GLOG_logtostderr=1

Also try to use gdb to get where the segmentation fault is happening, by
doing
gdb --args caffe [ARGS]

Sergio

2014-10-17 15:52 GMT-07:00 andreesteva notifications@github.com:

OK, here it is:

https://github.com/andreesteva/caffe-memory-test/blob/master/log_segfault.txt


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

OK, I'll try that.
Where do you suggest putting in the 'export GLOG_logtostderr=1' command?

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

If you use bash, then in your .bashrc, so it is there all time.

Sergio

2014-10-17 16:00 GMT-07:00 andreesteva notifications@github.com:

OK, I'll try that.
Where do you suggest putting in the 'export GLOG_logtostderr=1' command?


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

OK. I put up the log file:
log_segfault_debug.txt

I'm trying now with gdb

@andreesteva
Copy link
Author

I also just put up the gdb file, which isn't highly elucidative either. I hope you can make more sense of it than I can...

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

Given this warning, makes me guess that your LMDB is empty. Could you check
the size, and create it again?
I1017 16:18:34.443876 16468 data_layer.cpp:195] Restarting data prefetching
from start.

Sergio

2014-10-17 16:21 GMT-07:00 andreesteva notifications@github.com:

OK. I put up the log file:
log_segfault_debug.txt

I'm trying now with gdb


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

I checked the file size and even tried increasing the number of images generated. The lmdb_folder/data.mdb files are substantial, and when I rerun gdb --args caffe ... I end up with the same exact output.

@sguada
Copy link
Contributor

sguada commented Oct 18, 2014

Try with backend:LEVELDB instead

Sergio

2014-10-17 17:01 GMT-07:00 andreesteva notifications@github.com:

I checked the file size and even tried increasing the number of images
generated. The lmdb_folder/data.mdb files are substantial, and when I rerun
gdb --args caffe ... I end up with the same exact output.


Reply to this email directly or view it on GitHub
#1260 (comment).

@andreesteva
Copy link
Author

The same issue persists....

@sguada
Copy link
Contributor

sguada commented Oct 18, 2014

Don't know, please post your question at caffe-users mailing list

@andreesteva
Copy link
Author

Thats alright. Thank you so much for your help regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants