Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: File does not exist #153

Open
BashirBurhan opened this issue Aug 13, 2018 · 7 comments
Open

ValueError: File does not exist #153

BashirBurhan opened this issue Aug 13, 2018 · 7 comments
Labels

Comments

@BashirBurhan
Copy link

I am currently running this code on a jupyter notebook. I have a folder full of .lab and .txt files, however when I specify the corpus location in
magpie.init_word_vectors(path,vec_dim)
It simply returns this error:
ValueError: The file /Path/Untitled.i.txt doesn't exist

I dont understand where the Untitled.i.txt part is coming from.

Thank you for any help

@jstypka
Copy link
Collaborator

jstypka commented Aug 13, 2018

@BashirBurhan can you print the path variable before the call?

@BashirBurhan
Copy link
Author

@jstypka I dont quite understand what you mean?

@BashirBurhan
Copy link
Author

@jstypka

Path = '/home/forestreetds/notebooks/Forestreet/Burhan/Files/'
magpie.train_word2vec(Path,vec_dim = 100)

@jstypka
Copy link
Collaborator

jstypka commented Aug 13, 2018

@BashirBurhan so you run:
magpie.train_word2vec('/home/forestreetds/notebooks/Forestreet/Burhan/Files/', vec_dim = 100)
and you get:
ValueError: The file /Path/Untitled.i.txt doesn't exist?

That seems impossible. What's the full stacktrace?

@BashirBurhan
Copy link
Author

BashirBurhan commented Aug 14, 2018

@jstypka

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-104-507aa1249c46> in <module>()
      1 Path = '/home/forestreetds/notebooks/Forestreet/Burhan/Files/'
----> 2 magpie.train_word2vec(Path,vec_dim = 100)

/anaconda/envs/py35/lib/python3.5/site-packages/magpie/main.py in train_word2vec(self, train_dir, vec_dim)
    250                   file=sys.stderr)
    251 
--> 252         self.word2vec_model = train_word2vec(train_dir, vec_dim=vec_dim)
    253 
    254         return self.word2vec_model

/anaconda/envs/py35/lib/python3.5/site-packages/magpie/base/word2vec.py in train_word2vec(doc_directory, vec_dim)
    118         size=vec_dim,
    119         min_count=MIN_WORD_COUNT,
--> 120         window=WORD2VEC_CONTEXT,
    121     )
    122 

/anaconda/envs/py35/lib/python3.5/site-packages/gensim/models/word2vec.py in __init__(self, sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words)
    467             if isinstance(sentences, GeneratorType):
    468                 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
--> 469             self.build_vocab(sentences, trim_rule=trim_rule)
    470             self.train(sentences)
    471 

/anaconda/envs/py35/lib/python3.5/site-packages/gensim/models/word2vec.py in build_vocab(self, sentences, keep_raw_vocab, trim_rule, progress_per, update)
    531 
    532         """
--> 533         self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
    534         self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
    535         self.finalize_vocab(update=update)  # build tables & arrays

/anaconda/envs/py35/lib/python3.5/site-packages/gensim/models/word2vec.py in scan_vocab(self, sentences, progress_per, trim_rule)
    543         vocab = defaultdict(int)
    544         checked_string_types = 0
--> 545         for sentence_no, sentence in enumerate(sentences):
    546             if not checked_string_types:
    547                 if isinstance(sentence, string_types):

/anaconda/envs/py35/lib/python3.5/site-packages/magpie/base/word2vec.py in __iter__(self)
    108             files = {filename[:-4] for filename in os.listdir(self.dirname)}
    109             for doc_id, fname in enumerate(files):
--> 110                 d = Document(doc_id, os.path.join(self.dirname, fname + '.txt'))
    111                 for sentence in d.read_sentences():
    112                     yield sentence

/anaconda/envs/py35/lib/python3.5/site-packages/magpie/base/document.py in __init__(self, doc_id, filepath, text)
     21         else:  # is a path to a file
     22             if not os.path.exists(filepath):
---> 23                 raise ValueError("The file " + filepath + " doesn't exist")
     24 
     25             self.filepath = filepath

ValueError: The file /home/forestreetds/notebooks/Forestreet/Burhan/Files/.ipynb_checkpo.txt doesn't exist

@BashirBurhan
Copy link
Author

@jstypka its always either .ipynb_checkpo.txt or untitled.i.txt

@jstypka
Copy link
Collaborator

jstypka commented Aug 14, 2018

Okay, now I see it. Take a look at these lines:

/anaconda/envs/py35/lib/python3.5/site-packages/magpie/base/word2vec.py in __iter__(self)
    108             files = {filename[:-4] for filename in os.listdir(self.dirname)}
    109             for doc_id, fname in enumerate(files):

We take the directory that you give in the argument (Path in your case), scan it for all the files and cut out the extension (very crude way, cut off the last 4 characters) and then load the files with .txt and .lab extensions.

You seem to have two additional files in the directory untitled.ipynb and .ipynb_checkpoint - Magpie thinks these are files to train on and fails to parse them. So if you want to fix the error just remove them from there and put them somewhere else - make sure that the training directory contains only files that can be used for training.

On Magpie side, we should add some more robust scanning and loading mechanism, so it can handle those situations or at least give a readable error.

@jstypka jstypka added the bug label Aug 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants