Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions repetition in test datasets (augmented) #5

Open
anette123 opened this issue Apr 6, 2021 · 0 comments
Open

Questions repetition in test datasets (augmented) #5

anette123 opened this issue Apr 6, 2021 · 0 comments

Comments

@anette123
Copy link

anette123 commented Apr 6, 2021

Hi team, I have noticed that there is a high repetition of questions in test datasets in augmented data. In particular, I am looking at synonym_generalization task, which reads data from /data_agumented/CLEVR/questions/synonym_generalization/i/, where data_augmented is the file I dowloaded from http://vcml.csail.mit.edu/data/dataset_augmentation.tgz as per the instruction. I can see the following:

  1. File /questions/synonym_generalization/0/test_questions.json consists of 60000 questions, which are the repetition of the following 9 questions:
       ['Is small a synonym of sphere?', 'Is shiny a synonym of sphere?',
       'Is shiny a synonym of small?', 'Is sphere a synonym of small?',
       'Is small a synonym of shiny?', 'Is sphere a synonym of shiny?',
       'Is small a synonym of small?', 'Is shiny a synonym of shiny?',
       'Is sphere a synonym of sphere?']
  1. File /questions/synonym_generalization/1/test_questions.json consists of 60000 questions, which are the repetition of the following 9 questions:
       ['Is cube a synonym of cube?', 'Is metal a synonym of cube?',
       'Is ball a synonym of cube?', 'Is metal a synonym of ball?',
       'Is cube a synonym of ball?', 'Is ball a synonym of ball?',
       'Is metal a synonym of metal?', 'Is ball a synonym of metal?',
       'Is cube a synonym of metal?']
  1. File /questions/synonym_generalization/2/test_questions.json consists of 60000 questions, which are the repetition of the following 1 question:
     ['Is metallic a synonym of metallic?'] 
  1. File /questions/synonym_generalization/3/test_questions.json consists of 60000 questions, which are the repetition of the following 16 questions:
      ['Is metal a synonym of shiny?', 'Is shiny a synonym of shiny?',
       'Is ball a synonym of large?', 'Is metal a synonym of large?',
       'Is shiny a synonym of large?', 'Is ball a synonym of ball?',
       'Is ball a synonym of shiny?', 'Is large a synonym of shiny?',
       'Is large a synonym of large?', 'Is shiny a synonym of ball?',
       'Is large a synonym of ball?', 'Is metal a synonym of ball?',
       'Is metal a synonym of metal?', 'Is shiny a synonym of metal?',
       'Is ball a synonym of metal?', 'Is large a synonym of metal?']

Is this behaviour expected? I could not find any other difference between these questions apart from 'question_index'. Would really appreciate your help on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant