Segmentation Fault with hybrid Overfeat-Alexnet architecture for large input images #1260

andreesteva · 2014-10-10T22:57:12Z

I am trying to implement a Krizhevsky net with the last 3 fully connected layers converted to overfeat's 1x1 convolution layers. I can get it to run with 256x256 input images up to 315x315 input images. Ultimately I'd like for the net to accept 400x640 sized images, but at the moment anything greater than 315x315 causes a segmentation fault.

Does anyone know what might be happening? Below are the solver.prototxt and train_val_segfault.prototxt that I'm using.

Solver.prototxt

net: "models/bvlc_reference_caffenet/train_val.prototxt"

net: "train_val_segfault.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 450000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000

snapshot_prefix: "models/bvlc_reference_caffenet/caffenet_train"

snapshot_prefix: "randcaffenet"
solver_mode: GPU

Tra_val_segfault.prototxt:
name: "CaffeNet"
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_train_leveldb"
batch_size: 50
}

transform_param {

crop_size: 227

mean_file: "random_image_mean.binaryproto"

mirror: true

}

include: { phase: TRAIN }
}
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_val_leveldb"
batch_size: 50
}

transform_param {

crop_size: 227

mean_file: "random_image_mean.binaryproto"

mirror: false

}

include: { phase: TEST }
}
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu1"
type: RELU
bottom: "conv1"
top: "conv1"
}
layers {
name: "pool1"
type: POOLING
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm1"
type: LRN
bottom: "pool1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "norm1"
top: "conv2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm2"
type: LRN
bottom: "pool2"
top: "norm2"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "norm2"
top: "conv3"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "conv4"
type: CONVOLUTION
bottom: "conv3"
top: "conv4"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu4"
type: RELU
bottom: "conv4"
top: "conv4"
}
layers {
name: "conv5"
type: CONVOLUTION
bottom: "conv4"
top: "conv5"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu5"
type: RELU
bottom: "conv5"
top: "conv5"
}
layers {
name: "pool5"
type: POOLING
bottom: "conv5"
top: "pool5"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "fc6-conv"
type: CONVOLUTION
bottom: "pool5"
top: "fc6-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu6"
type: RELU
bottom: "fc6-conv"
top: "fc6-conv"
}
layers {
name: "drop6"
type: DROPOUT
bottom: "fc6-conv"
top: "fc6-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc7-conv"
type: CONVOLUTION
bottom: "fc6-conv"
top: "fc7-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu7"
type: RELU
bottom: "fc7-conv"
top: "fc7-conv"
}
layers {
name: "drop7"
type: DROPOUT
bottom: "fc7-conv"
top: "fc7-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc8-conv"
type: CONVOLUTION
bottom: "fc7-conv"
top: "fc8-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 1000
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "fc8-conv"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}
layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "fc8-conv"
bottom: "label"
top: "loss"
}

sguada · 2014-10-11T00:39:17Z

Probably running out of memory try to reduce the batch_size

On Friday, October 10, 2014, andreesteva notifications@github.com wrote:

I am trying to implement a Krizhevsky net with the last 3 fully connected
layers converted to overfeat's 1x1 convolution layers. I can get it to run
with 256x256 input images up to 315x315 input images. Ultimately I'd like
for the net to accept 400x640 sized images, but at the moment anything
greater than 315x315 causes a segmentation fault.

Does anyone know what might be happening? Below are the solver.prototxt
and train_val_segfault.prototxt that I'm using.

Solver.prototxt
#net: "models/bvlc_reference_caffenet/train_val.prototxt"
net: "train_val_segfault.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 450000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
#snapshot_prefix: "models/bvlc_reference_caffenet/caffenet_train"
snapshot_prefix: "randcaffenet"
solver_mode: GPU

Tra_val_segfault.prototxt:
name: "CaffeNet"
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_train_leveldb"
batch_size: 50
}
transform_param { crop_size: 227 mean_file:
"random_image_mean.binaryproto" mirror: true }

include: { phase: TRAIN }
}
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "random_val_leveldb"
batch_size: 50
}
transform_param { crop_size: 227 mean_file:
"random_image_mean.binaryproto" mirror: false }

include: { phase: TEST }
}
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu1"
type: RELU
bottom: "conv1"
top: "conv1"
}
layers {
name: "pool1"
type: POOLING
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm1"
type: LRN
bottom: "pool1"
top: "norm1"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv2"
type: CONVOLUTION
bottom: "norm1"
top: "conv2"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu2"
type: RELU
bottom: "conv2"
top: "conv2"
}
layers {
name: "pool2"
type: POOLING
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "norm2"
type: LRN
bottom: "pool2"
top: "norm2"
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
name: "conv3"
type: CONVOLUTION
bottom: "norm2"
top: "conv3"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "relu3"
type: RELU
bottom: "conv3"
top: "conv3"
}
layers {
name: "conv4"
type: CONVOLUTION
bottom: "conv3"
top: "conv4"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu4"
type: RELU
bottom: "conv4"
top: "conv4"
}
layers {
name: "conv5"
type: CONVOLUTION
bottom: "conv4"
top: "conv5"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu5"
type: RELU
bottom: "conv5"
top: "conv5"
}
layers {
name: "pool5"
type: POOLING
bottom: "conv5"
top: "pool5"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
name: "fc6-conv"
type: CONVOLUTION
bottom: "pool5"
top: "fc6-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu6"
type: RELU
bottom: "fc6-conv"
top: "fc6-conv"
}
layers {
name: "drop6"
type: DROPOUT
bottom: "fc6-conv"
top: "fc6-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc7-conv"
type: CONVOLUTION
bottom: "fc6-conv"
top: "fc7-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 4096
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.005
}
bias_filler {
type: "constant"
value: 1
}
}
}
layers {
name: "relu7"
type: RELU
bottom: "fc7-conv"
top: "fc7-conv"
}
layers {
name: "drop7"
type: DROPOUT
bottom: "fc7-conv"
top: "fc7-conv"
dropout_param {
dropout_ratio: 0.5
}
}
layers {
name: "fc8-conv"
type: CONVOLUTION
bottom: "fc7-conv"
top: "fc8-conv"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 1000
kernel_size: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "fc8-conv"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}
layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "fc8-conv"
bottom: "label"
top: "loss"
}

—
Reply to this email directly or view it on GitHub
#1260.

Sergio

sunbaigui · 2014-10-11T02:12:08Z

I'm wonder if this configure will help to improve accuracy or speed or memory usage?

andreesteva · 2014-10-13T17:50:03Z

@sguada @sunbaigui
The purpose of this configuration is to reduce memory overhead. I would like to scale this up to 400x640 sized input images. At the moment, the segmentation fault is not due to memory issues. For example, using this config I can pass 315x315 = 99225 sized images, but not 256 x 300 = 76800.

sguada · 2014-10-13T19:12:08Z

@andreesteva follow the questions caffe-users mailing list

I don't think there is any bug here, since it works with 315x315 images, so probably there is a problem of your prototxt definition, for instance when you change the size of the images, do you change the mean_file accordingly? Otherwise you should use mean_values instead #1070.

andreesteva · 2014-10-17T21:28:29Z

@sguada
There must be a bug. When I run this on an older version of caffe (from the August time frame), it works just fine. This leads me to believe that its not a memory problem, nor a prototxt definition problem. I am changing the mean_file accordingly, and I still encounter the same segmentation fault. Here are the steps that I am taking:

On old caffe version:
generate random image buffers using matlab: size 480x640 (Success)
create leveldb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (success)

On new caffe version, with cudnn
generate random image buffers using matlab: size 480x640 (Success)
create lmdb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (segfault)

From caffe version to version, I've encountered times when the examples haven't been fully updated (i.e. the argument syntax for creating leveldb files changed). Could it possibly be due to that?

If helpful, I can upload a package to github which can allow you to recreate the error.

sguada · 2014-10-17T21:50:19Z

Could you try new caffe version without cudnn?
And yeah upload the log file or further info to help to clarify your issue.
It can still be a problem of padding misalignment.

Sergio

2014-10-17 14:28 GMT-07:00 andreesteva notifications@github.com:

@sguada https://github.com/sguada
There must be a bug. When I run this on an older version of caffe (from
the August time frame), it works just fine. This leads me to believe that
its not a memory problem, nor a prototxt definition problem. I am changing
the mean_file accordingly, and I still encounter the same segmentation
fault. Here are the steps that I am taking:

On old caffe version:
generate random image buffers using matlab: size 480x640 (Success)
create leveldb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (success)

On new caffe version, with cudnn
generate random image buffers using matlab: size 480x640 (Success)
create lmdb files, following the imagenet example (Success)
Compute the image mean, using build/tools/compute_image_mean (success)
train caffe (segfault)

From caffe version to version, I've encountered times when the examples
haven't been fully updated (i.e. the argument syntax for creating leveldb
files changed). Could it possibly be due to that?

If helpful, I can upload a package to github which can allow you to
recreate the error.

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-17T22:32:31Z

I'll try a version without cudnn

In the meantime, I've uploaded the code I used for the old caffe and the new caffe version to
https://github.com/andreesteva/caffe-memory-test

It includes all the training and validation images already, as well as the leveldb and lmdb files used.

Copy either folder into the folder $CAFFE_ROOT/examples/ and then run
bash run_training.sh solver_alexoverfeat.prototxt

To recreate the error I'm seeing.

sguada · 2014-10-17T22:37:53Z

I'm sorry, but I'm not planning to recreate your error, can you just post
the log?

Sergio

2014-10-17 15:32 GMT-07:00 andreesteva notifications@github.com:

I'll try a version without cudnn

In the meantime, I've uploaded the code I used for the old caffe and the
new caffe version to
https://github.com/andreesteva/caffe-memory-test

It includes all the training and validation images already, as well as the
leveldb and lmdb files used.

Copy either folder into the folder $CAFFE_ROOT/examples/ and then run
bash run_training.sh solver_alexoverfeat.prototxt

To recreate the error I'm seeing.

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-17T22:47:19Z

Sure - is this a file that is generated somewhere or just the stdout output when you run caffe?

sguada · 2014-10-17T22:49:00Z

It is the output generated when you run caffe, but you can redirect it with:
caffe ... 2> log.file

Sergio

2014-10-17 15:47 GMT-07:00 andreesteva notifications@github.com:

Sure - is this a file that is generated somewhere or just the stdout
output when you run caffe?

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-17T22:52:51Z

OK, here it is:
https://github.com/andreesteva/caffe-memory-test/blob/master/log_segfault.txt

sguada · 2014-10-17T22:58:35Z

The segfault is not very informative, can you make clean then recompile
with DEBUG:=1 in the Makefile.config, and double check that you have done
export GLOG_logtostderr=1

Also try to use gdb to get where the segmentation fault is happening, by
doing
gdb --args caffe [ARGS]

Sergio

2014-10-17 15:52 GMT-07:00 andreesteva notifications@github.com:

OK, here it is:

https://github.com/andreesteva/caffe-memory-test/blob/master/log_segfault.txt

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-17T23:00:30Z

OK, I'll try that.
Where do you suggest putting in the 'export GLOG_logtostderr=1' command?

sguada · 2014-10-17T23:01:50Z

If you use bash, then in your .bashrc, so it is there all time.

Sergio

2014-10-17 16:00 GMT-07:00 andreesteva notifications@github.com:

OK, I'll try that.
Where do you suggest putting in the 'export GLOG_logtostderr=1' command?

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-17T23:20:57Z

OK. I put up the log file:
log_segfault_debug.txt

I'm trying now with gdb

andreesteva · 2014-10-17T23:44:05Z

I also just put up the gdb file, which isn't highly elucidative either. I hope you can make more sense of it than I can...

sguada · 2014-10-17T23:47:27Z

Given this warning, makes me guess that your LMDB is empty. Could you check
the size, and create it again?
I1017 16:18:34.443876 16468 data_layer.cpp:195] Restarting data prefetching
from start.

Sergio

2014-10-17 16:21 GMT-07:00 andreesteva notifications@github.com:

OK. I put up the log file:
log_segfault_debug.txt

I'm trying now with gdb

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-18T00:01:53Z

I checked the file size and even tried increasing the number of images generated. The lmdb_folder/data.mdb files are substantial, and when I rerun gdb --args caffe ... I end up with the same exact output.

sguada · 2014-10-18T00:15:47Z

Try with backend:LEVELDB instead

Sergio

2014-10-17 17:01 GMT-07:00 andreesteva notifications@github.com:

I checked the file size and even tried increasing the number of images
generated. The lmdb_folder/data.mdb files are substantial, and when I rerun
gdb --args caffe ... I end up with the same exact output.

—
Reply to this email directly or view it on GitHub
#1260 (comment).

andreesteva · 2014-10-18T01:13:23Z

The same issue persists....

sguada · 2014-10-18T01:29:12Z

Don't know, please post your question at caffe-users mailing list

andreesteva · 2014-10-18T01:34:19Z

Thats alright. Thank you so much for your help regardless.

sguada closed this as completed Oct 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault with hybrid Overfeat-Alexnet architecture for large input images #1260

Segmentation Fault with hybrid Overfeat-Alexnet architecture for large input images #1260

andreesteva commented Oct 10, 2014

sguada commented Oct 11, 2014

sunbaigui commented Oct 11, 2014

andreesteva commented Oct 13, 2014

sguada commented Oct 13, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 18, 2014

sguada commented Oct 18, 2014

andreesteva commented Oct 18, 2014

sguada commented Oct 18, 2014

andreesteva commented Oct 18, 2014

Segmentation Fault with hybrid Overfeat-Alexnet architecture for large input images #1260

Segmentation Fault with hybrid Overfeat-Alexnet architecture for large input images #1260

Comments

andreesteva commented Oct 10, 2014

net: "models/bvlc_reference_caffenet/train_val.prototxt"

snapshot_prefix: "models/bvlc_reference_caffenet/caffenet_train"

transform_param {

crop_size: 227

mean_file: "random_image_mean.binaryproto"

mirror: true

}

transform_param {

crop_size: 227

mean_file: "random_image_mean.binaryproto"

mirror: false

}

sguada commented Oct 11, 2014

sunbaigui commented Oct 11, 2014

andreesteva commented Oct 13, 2014

sguada commented Oct 13, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 17, 2014

andreesteva commented Oct 17, 2014

sguada commented Oct 17, 2014

andreesteva commented Oct 18, 2014

sguada commented Oct 18, 2014

andreesteva commented Oct 18, 2014

sguada commented Oct 18, 2014

andreesteva commented Oct 18, 2014