reproducing results paper #15

koenvanderveen · 2018-12-20T13:24:23Z

Hi!,
I was playing with your code, great work! I am trying to reproduce the results from your paper on WikiSQL. However, when using run.sh I get results in the 70.3 ballpark (on dev set) instead of the reported 72.2%. Are there any parameters I need to change to get the reported results?

Thanks in advance!

crazydonkey200 · 2018-12-21T16:34:36Z

Thanks for asking the question. The result in the paper is obtained using the default parameter in the repo on a AWS g3.xlarge machine.

There are 3 sources for the difference between experiments (and the sensitivity of RL training tends to amplify it):
(1) The stochasticity in random seed.
(2) The stochasticity in asynchronous training.
(3) Different machine configuration. In my experience, sometimes even the same type of instances can have some difference due to the cloud.

But the difference you saw are larger than the standard deviation in my experiments, so I would also like to investigate it.

I am working on an update to fix (1) and (2) to make experiments more determinisitc. For (3), may I know the machine configuration your are using?

In README, I attached a picture of the learning curve of one run that reached 72.35% dev accuracy on WikiSQL. If it helps, I can also share with you the full tensorboard log and the saved best model from some more recent experiment.

koenvanderveen · 2018-12-21T17:37:06Z

Thanks for your quick response! I used a AWS g3.xlarge. I tried multiple times but I do get consistent results around 70.3.

crazydonkey200 · 2018-12-22T16:23:49Z

Thanks for the input. I will try starting some new AWS instances to see if I can replicate the issue. In the meantime, here's a link to the data of a recent run that reached 72.2% dev accuracy. The tensorboard log is in the tb_log subfolder, and the best model is saved in the best_model subfolder.

koenvanderveen · 2018-12-29T16:04:12Z

Thanks, I'd love to find out where the difference originates from. I downloaded the repo again to make sure I did not make any changes and ran again, but reached the same result. The only thing I had to change to make it work is replacing (line 70 table/utils.py) :

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError): 
  val = val.lower()

with

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError, UnicodeEncodeError): 
  val = val.lower()

Due to errors like this
UnicodeEncodeError: 'decimal' codec can't encode character u'\u2013' in position 1: invalid decimal Unicode string

Do you think that might be the reason? And if so, do you have any idea how to prevent catching those errors?

crazydonkey200 · 2019-01-09T11:11:05Z

Sorry for the late reply. I have added your change into the codebase and rerun the experiments on two new AWS instances. The mean and std from 3 experiments (each averages 5 runs) are 71.92+-0.21%, 71.97+-0.17%, 71.93+-0.38%. You can also download all the data for these 3 experiments here 1 2 3.

I am also curious about the reason of the difference. I have added a new branch named fix_randomization to make the results more reproducible by controlling the random seeds. Would you like to try running the experiments again using the new branch on a AWS instance and let me know if anything changes?

Thanks.

koenvanderveen · 2019-01-28T15:11:33Z

Hi! i ran the experiments again in the fix_randomization branch, but it did not result in different results (still around 70%). Did you re-download the data before running the experiments? I cannot think of any other form of randomness at this point but the difference is quite consistent.

koenvanderveen · 2019-01-30T10:47:44Z

Oke, I finally found the source of the difference. I used a newer version of the Deep Learning AMI in AWS, i ran the experiments with v10 now and got the same results (around 71.7) . Would be interesting to know which operations are changed.

crazydonkey200 · 2019-01-31T09:45:23Z

Thanks for reporting this and for running the experiments to confirm it!

That's interesting. I would also like to look into this. So what is the newer version of Deep Learning AMI you used, is it Deep Learning AMI (Ubuntu) Version 21.0 - ami-0b294f219d14e6a82? And how do you launch instances with previous versions, for example v10? Thanks!

dungtn · 2019-02-02T00:32:12Z

Hi there :-)

I'm trying to replicate the results of WikiTableQuestions. I tried Tensorflow v1.12.0 (Deep Learning AMI 21.0) and v1.8.0 (Deep Learning AMI 10.0). The corresponding accuracies are 41.12% for v1.12.0 and 43.27% for v1.8.0. It looks like the difference is because of Tensorflow version.

Also, is the current settings in run.sh the one that was used to produce the learning curve in the image? The number of steps was set to 25,000. While in the picture, the number of steps is around 30,000. Also, the max_n_mem was set to 60, which caused Not enough memory slots for example.... I changed it to 100, but I'm not sure if it is the right thing to do? Thanks!

crazydonkey200 · 2019-02-02T08:01:32Z

Hi, thanks for the information :) I will run some experiments to compare TF v1.12.0 vs v1.8.0.

The current setting in run.sh is used to produce the result in the paper. The image is produced from an old setting that trains for 30k steps. Thanks for pointing it out. I will replace the image with a run under the current setting.

The max_n_mem was set to 60 for the sake of speed. When the table is large and requires more than 60 memory slots, some columns will be dropped (reason for the Not enough memory slots for example... warning). Changing it to 100 would probably achieve a similar or better result because no columns will be dropped, but the training will be slower.

crazydonkey200 · 2019-03-08T07:22:53Z

As an update, I have created a branch reproducible that can run training deterministically. Because it is hard to make Tensorflow deterministic when using GPU (see here for more info) and when running with multiprocessing, this branch uses only 1 trainer and 1 actor for training, so the training is very slow (takes about 44hrs to finish one training, which only takes 2-3hrs in the master branch). This branch is using tensorflow-gpu==1.12.0.

This setting gets slightly lower results on WikiTable (41.51+-0.19% dev accuracy, 42.78+-0.77% test accuracy). Below is the command to reproduce the experiments (after pulling the latest version of the repo):

git checkout reproducible
cd ~/projects/neural-symbolic-machines/table/wtq/
./run_experiments.sh run_rpd.sh mapo mapo_rpd

dungtn · 2019-03-29T21:58:13Z

Can you add more details about dataset preprocessing? For example, how to generate the all_train_saved_programs.json file?

guotong1988 · 2019-05-30T12:36:08Z

Where do you get stop_words.json?

crazydonkey200 · 2021-03-31T01:07:59Z

@dungtn Here's a detailed summary created by another researcher on how to replicate the preprocessing and experiments starting from the raw WikiTableQuestions dataset and how to adapt the code to other similar datasets. Also added the link to this summary into the readme.

@guotong1988 Unfortunately I don't remember where exactly I got the list ofstop_words.json, but it seems to be a subset of the nltk stop words, for example found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducing results paper #15

reproducing results paper #15

koenvanderveen commented Dec 20, 2018

crazydonkey200 commented Dec 21, 2018

koenvanderveen commented Dec 21, 2018

crazydonkey200 commented Dec 22, 2018

koenvanderveen commented Dec 29, 2018 •

edited

Loading

crazydonkey200 commented Jan 9, 2019

koenvanderveen commented Jan 28, 2019

koenvanderveen commented Jan 30, 2019

crazydonkey200 commented Jan 31, 2019 •

edited

Loading

dungtn commented Feb 2, 2019

crazydonkey200 commented Feb 2, 2019

crazydonkey200 commented Mar 8, 2019

dungtn commented Mar 29, 2019

guotong1988 commented May 30, 2019

crazydonkey200 commented Mar 31, 2021 •

edited

Loading

reproducing results paper #15

reproducing results paper #15

Comments

koenvanderveen commented Dec 20, 2018

crazydonkey200 commented Dec 21, 2018

koenvanderveen commented Dec 21, 2018

crazydonkey200 commented Dec 22, 2018

koenvanderveen commented Dec 29, 2018 • edited Loading

crazydonkey200 commented Jan 9, 2019

koenvanderveen commented Jan 28, 2019

koenvanderveen commented Jan 30, 2019

crazydonkey200 commented Jan 31, 2019 • edited Loading

dungtn commented Feb 2, 2019

crazydonkey200 commented Feb 2, 2019

crazydonkey200 commented Mar 8, 2019

dungtn commented Mar 29, 2019

guotong1988 commented May 30, 2019

crazydonkey200 commented Mar 31, 2021 • edited Loading

koenvanderveen commented Dec 29, 2018 •

edited

Loading

crazydonkey200 commented Jan 31, 2019 •

edited

Loading

crazydonkey200 commented Mar 31, 2021 •

edited

Loading