Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducing results paper #15

Open
koenvanderveen opened this issue Dec 20, 2018 · 14 comments
Open

reproducing results paper #15

koenvanderveen opened this issue Dec 20, 2018 · 14 comments

Comments

@koenvanderveen
Copy link

Hi!,
I was playing with your code, great work! I am trying to reproduce the results from your paper on WikiSQL. However, when using run.sh I get results in the 70.3 ballpark (on dev set) instead of the reported 72.2%. Are there any parameters I need to change to get the reported results?

Thanks in advance!

@crazydonkey200
Copy link
Owner

Thanks for asking the question. The result in the paper is obtained using the default parameter in the repo on a AWS g3.xlarge machine.

There are 3 sources for the difference between experiments (and the sensitivity of RL training tends to amplify it):
(1) The stochasticity in random seed.
(2) The stochasticity in asynchronous training.
(3) Different machine configuration. In my experience, sometimes even the same type of instances can have some difference due to the cloud.

But the difference you saw are larger than the standard deviation in my experiments, so I would also like to investigate it.

I am working on an update to fix (1) and (2) to make experiments more determinisitc. For (3), may I know the machine configuration your are using?

In README, I attached a picture of the learning curve of one run that reached 72.35% dev accuracy on WikiSQL. If it helps, I can also share with you the full tensorboard log and the saved best model from some more recent experiment.

@koenvanderveen
Copy link
Author

Thanks for your quick response! I used a AWS g3.xlarge. I tried multiple times but I do get consistent results around 70.3.

@crazydonkey200
Copy link
Owner

Thanks for the input. I will try starting some new AWS instances to see if I can replicate the issue. In the meantime, here's a link to the data of a recent run that reached 72.2% dev accuracy. The tensorboard log is in the tb_log subfolder, and the best model is saved in the best_model subfolder.

@koenvanderveen
Copy link
Author

koenvanderveen commented Dec 29, 2018

Thanks, I'd love to find out where the difference originates from. I downloaded the repo again to make sure I did not make any changes and ran again, but reached the same result. The only thing I had to change to make it work is replacing (line 70 table/utils.py) :

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError): 
  val = val.lower() 

with

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError, UnicodeEncodeError): 
  val = val.lower() 

Due to errors like this
UnicodeEncodeError: 'decimal' codec can't encode character u'\u2013' in position 1: invalid decimal Unicode string

Do you think that might be the reason? And if so, do you have any idea how to prevent catching those errors?

@crazydonkey200
Copy link
Owner

Sorry for the late reply. I have added your change into the codebase and rerun the experiments on two new AWS instances. The mean and std from 3 experiments (each averages 5 runs) are 71.92+-0.21%, 71.97+-0.17%, 71.93+-0.38%. You can also download all the data for these 3 experiments here 1 2 3.

I am also curious about the reason of the difference. I have added a new branch named fix_randomization to make the results more reproducible by controlling the random seeds. Would you like to try running the experiments again using the new branch on a AWS instance and let me know if anything changes?

Thanks.

@koenvanderveen
Copy link
Author

Hi! i ran the experiments again in the fix_randomization branch, but it did not result in different results (still around 70%). Did you re-download the data before running the experiments? I cannot think of any other form of randomness at this point but the difference is quite consistent.

@koenvanderveen
Copy link
Author

Oke, I finally found the source of the difference. I used a newer version of the Deep Learning AMI in AWS, i ran the experiments with v10 now and got the same results (around 71.7) . Would be interesting to know which operations are changed.

@crazydonkey200
Copy link
Owner

crazydonkey200 commented Jan 31, 2019

Thanks for reporting this and for running the experiments to confirm it!

That's interesting. I would also like to look into this. So what is the newer version of Deep Learning AMI you used, is it Deep Learning AMI (Ubuntu) Version 21.0 - ami-0b294f219d14e6a82? And how do you launch instances with previous versions, for example v10? Thanks!

@dungtn
Copy link

dungtn commented Feb 2, 2019

Hi there :-)

I'm trying to replicate the results of WikiTableQuestions. I tried Tensorflow v1.12.0 (Deep Learning AMI 21.0) and v1.8.0 (Deep Learning AMI 10.0). The corresponding accuracies are 41.12% for v1.12.0 and 43.27% for v1.8.0. It looks like the difference is because of Tensorflow version.

Also, is the current settings in run.sh the one that was used to produce the learning curve in the image? The number of steps was set to 25,000. While in the picture, the number of steps is around 30,000. Also, the max_n_mem was set to 60, which caused Not enough memory slots for example.... I changed it to 100, but I'm not sure if it is the right thing to do? Thanks!

@crazydonkey200
Copy link
Owner

Hi, thanks for the information :) I will run some experiments to compare TF v1.12.0 vs v1.8.0.

The current setting in run.sh is used to produce the result in the paper. The image is produced from an old setting that trains for 30k steps. Thanks for pointing it out. I will replace the image with a run under the current setting.

The max_n_mem was set to 60 for the sake of speed. When the table is large and requires more than 60 memory slots, some columns will be dropped (reason for the Not enough memory slots for example... warning). Changing it to 100 would probably achieve a similar or better result because no columns will be dropped, but the training will be slower.

@crazydonkey200
Copy link
Owner

As an update, I have created a branch reproducible that can run training deterministically. Because it is hard to make Tensorflow deterministic when using GPU (see here for more info) and when running with multiprocessing, this branch uses only 1 trainer and 1 actor for training, so the training is very slow (takes about 44hrs to finish one training, which only takes 2-3hrs in the master branch). This branch is using tensorflow-gpu==1.12.0.

This setting gets slightly lower results on WikiTable (41.51+-0.19% dev accuracy, 42.78+-0.77% test accuracy). Below is the command to reproduce the experiments (after pulling the latest version of the repo):

git checkout reproducible
cd ~/projects/neural-symbolic-machines/table/wtq/
./run_experiments.sh run_rpd.sh mapo mapo_rpd

@dungtn
Copy link

dungtn commented Mar 29, 2019

Can you add more details about dataset preprocessing? For example, how to generate the all_train_saved_programs.json file?

@guotong1988
Copy link

Where do you get stop_words.json?

@crazydonkey200
Copy link
Owner

crazydonkey200 commented Mar 31, 2021

@dungtn Here's a detailed summary created by another researcher on how to replicate the preprocessing and experiments starting from the raw WikiTableQuestions dataset and how to adapt the code to other similar datasets. Also added the link to this summary into the readme.

@guotong1988 Unfortunately I don't remember where exactly I got the list ofstop_words.json, but it seems to be a subset of the nltk stop words, for example found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants