Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large data dimension size when initialising training #62

Open
SpaceMeerkat opened this issue Sep 13, 2022 · 8 comments
Open

Very large data dimension size when initialising training #62

SpaceMeerkat opened this issue Sep 13, 2022 · 8 comments

Comments

@SpaceMeerkat
Copy link

SpaceMeerkat commented Sep 13, 2022

Hi,

I'm trying to train an SOM (just to get used to the way PINK works, hence the tiny file numbers and parameters below) but when I set the training run going, my output log file is growing in size at a rate of ~1Gb every 30 seconds, being filled with imformation like that shown in the quote below.

Number of data entries = 1096810496
Data dimension = 1097859072 x 1100480512 x 1102053376 x 1101004800 x 1099956224 x 1099431936 x 1101004800 x 1097859072 x 1098907648 x 1101004800 x 1101004800 x 1103101952 x 1104674816 x 1103101952 x 1105199104 x 1102577664 x 1101529088 x 1101004800 x 1098907648 x 1100480512 x 1101529088 x 1102053376 x 1094713344 x 1093664768 x 1103626240 x 1099956224 x 1088421888 x 1101004800 x 1099431936 x 1099956224 x 1104150528 x 1095761920 x 1102577664 x 1101529088 x 1101004800 x 1099956224 x 1099431936 x 1101529088 x 1103626240 x 1101004800 x 1101004800 x 1102577664 x 1097859072 x 1098907648 x 1101529088 x 1101004800 x 1097859072 x 1102053376 x 1099956224 x 1096810496 x 1101004800 x 1098907648 x 1097859072 x 1100480512 x 1100480512 x 1098907648 x 1101529088 x 1096810496 x 1097859072 x 1095761920 x 1099956224 x 1091567616 x 1098907648 x 1101529088 x 1099431936 x 1092616192 x 1096810496 x 1101529088 x 1099431936 x 1096810496 x 1100480512 x 1106247680 x 1101529088 x 1103101952 x 1101529088 x 1102053376 x 1102053376 x 1097859072 x 1101004800 x 1095761920 x 1096810496 x 1104674816 x 1097859072 x 1099956224 x 1102053376 x 1097859072 x 1099956224 x 1104674816 x 1099956224 x 1109917696 x 1094713344 x 1097859072 x 1111490560 x 1100480512 x 1102577664 x 1113325568 x 1099431936 x 1103626240 x 1113325568 x 1101529088 x 1103626240 x 1113849856 x 1105199104 x 1100480512 x 1111752704 x 1098907648 x 1097859072 x 1108606976 x 1096810496 x 1096810496 x 1104674816 x 1102053376 x 1097859072 x 1106771968 x 1101529088 x 1099956224 x 1102577664 x 1101004800 x 1096810496 x 1100480512 x 1097859072 x 1096810496 x 1103626240 x 1096810496 x 1100480512 x 1092616192 x 1099431936 x 1102577664 x 1091567616 x 1097859072 x 1098907648 x 1097859072 x 1098907648 x 1098907648 x 1094713344 x 1096810496 x 1099956224 x 1099956224 x 1098907648 x 1099431936 x 1105723392 x 1101529088 x 1099431936 x 1108344832 x 1101529088 x 1100480512 x 1105723392 x 1098907648 x 1098907648 x 1104150528 x 1103101952 x 1103101952 x 1107296256 x 1099431936 x 1098907648 x 1104150528 x 1098907648 x 1100480512 x 1104150528 x 1103626240 x 1099956224 x 1103626240 x 1097859072 x 1102053376 x 1099956224 x 1100480512 x

I set the training run going using the following:

$Pink --train _scripts/test.bin _pink_out/som.bin --som-width 10 --som-height 10 --num-iter 1 --numrot 4

So as you can see I'm keeping this little training run simple with a small gridsize as well as only 4 rotations per image.

This is for a training run where I'm just using 3 images of size 128x128, in float 32 format. So I'm confused as to why the number of data entries is so high in the first line of the quote above?

I'd really appreciate any help you can give on this!

Extra info

My images are stored in test.bin using the following numpy raw binary example:

filename = "test.bin"
fileobj = open(filename, mode='wb')
stacked_images.tofile(fileobj)
fileobj.close()

... where stacked_images has shape (128, 128, 3) or (3, 128, 128). (I've tried both ways to see if that was the problem but it still causes the issue above)

@tjgalvin
Copy link

So, PINK expects a unique file format that contains the input images to train against. Among other things, you have to specify the conditionality of the images, their actual dimensions, their data-type, and the number of images.

Having a quick look at your issue, I am guessing the problem is that PINK is reading a set of bytes at the start of the file, and is inferring that your data has many (many!) dimensions. What it is actually reading though are random bytes belonging to the numpy array that has been written to disk.

There is an example of the expected header format in script form here: https://github.com/HITS-AIN/PINK/blob/master/scripts/create_test_image.py

A couple years ago when I was working with PINK I did write a very basic ImageWriter class to help me out: https://github.com/tjgalvin/pyink/blob/master/pyink/binwrap.py

Hopefully this helps :)

@SpaceMeerkat
Copy link
Author

Ooooh, okay that makes sense! Thanks for the tips and links, I'll have a read through these scripts and see if I can get these images into the right format finally; your ImageWriter looks really really helpful thanks!

@tjgalvin
Copy link

tjgalvin commented Sep 13, 2022 via email

@BerndDoser
Copy link
Member

Please find a description of the binary file format at https://github.com/HITS-AIN/PINK/blob/master/FILE_FORMATS.md.

@SpaceMeerkat
Copy link
Author

Thanks, I really appreciate that!

@SpaceMeerkat
Copy link
Author

So, PINK expects a unique file format that contains the input images to train against. Among other things, you have to specify the conditionality of the images, their actual dimensions, their data-type, and the number of images.

Having a quick look at your issue, I am guessing the problem is that PINK is reading a set of bytes at the start of the file, and is inferring that your data has many (many!) dimensions. What it is actually reading though are random bytes belonging to the numpy array that has been written to disk.

There is an example of the expected header format in script form here: https://github.com/HITS-AIN/PINK/blob/master/scripts/create_test_image.py

A couple years ago when I was working with PINK I did write a very basic ImageWriter class to help me out: https://github.com/tjgalvin/pyink/blob/master/pyink/binwrap.py

Hopefully this helps :)

Hi Both,

Thanks for your replies in this thread. Happy to close it if it's annoying having me leave it open, but I thought I'd ask a quick question first!

I've been able to train the SOM, and map images to the trained map using the Pink --map <image-file> <result-file> <SOM-file> functionality of PINK.

But is there a way of providing the training images to the trained map, and replacing the images which are output by MAPVisualizer() with real images from the training set? (Basically, the reverse BMU)

Sorry if that's a really naive question, I'm struggling with the binary file paradigm which means that even though I know how to do it in theory, the direct reading (and then comparison) of the mapped_data.bin file and the saved SOM.bin files is a little confusing for me still!

@tjgalvin , if there's a specific function in your ImageWriter class that is appropriate for this then maybe that's a good place for me to start!

Many thanks for any help on this!

@tjgalvin
Copy link

Sorry for my delayed reply @SpaceMeerkat. The use of binary files can be a little tricky to get around. I am a little unclear on what you are asking, but I think it is how to list all images that best match to a particule BMU of interest. So, to do this you will need to load the map file. This should have a shape that is something like (number_of_images, SOM_depth, SOM_height, SOM_width), and each element in this array is the euclidean distance between the i-th image and each neuron.

You would need to have a particular neuron of interest, and you will need to find all images where their minimum euclidean distance is at the neuron coordinate. Does that make sense? Once you have these indices, you can use them to pull out the images of interest from your image binary file.

If you are using my code from pyink you want to use the Mapping class to load one of the euclidean distance binary files produced by pink. In this class there is a method called bmu that returns the neuron coordinate of the BMU for an index or indices. There is also another method called images_with_bmu that returns the indices of all images that have a particular neuron as their BMU. These should give you the indices you need to pull out the images you want using the ImageReader class. You do not want to use the ImageWriter class to access data from an existing binary file.

Sorry that there are no doc pages for this old code of mine. I never got around to doing it.

@SpaceMeerkat
Copy link
Author

Hey @tjgalvin , don't be sorry I'm just happy to have any help on this at all! I didn't do a great job of explaining my question but you were pretty close with what you guessed I was getting at! I was actually aiming to do the reverse, so for a particular map neuron, find the image from the training images that best matches to it... but from the bmu function in Mapping, I can see that it'd be easy to add in an extra bit that returns the min rather than just evaluating the argmin to get the neuron in question. But from your description of the map file, it looks like I could just loop over each neuron and find the corresponding image with the lowest Euclidean distance from there! So it shouldn't require much hacking on my part to get the images which best match to the neurons in the final maps from your code.

It's a great package of helper functions, I can imagine it'll be a pain to go back and write docs for it now but when used hand-in-hand with the PINK package this makes life so much easier! Thanks for all this help :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants