Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion: Use a single file for labels and text #151

Open
shashi-netra opened this issue Aug 4, 2018 · 5 comments
Open

suggestion: Use a single file for labels and text #151

shashi-netra opened this issue Aug 4, 2018 · 5 comments

Comments

@shashi-netra
Copy link

In the current version you have .lab and .txt files - one each for a training row. Wouldn't it be easier to save these in a single file or a single one for labels and another for text files? Wouldn't this be more
idiomatic (a la scikit-learn)

Having several million .lab files and .txt files is especially problematic when there are millions of files and the filesystem chokes up.

@jstypka
Copy link
Collaborator

jstypka commented Aug 4, 2018

@shashi-netra you're right, having an other option of loading files would be a reasonable feature. I think you're actually not the first who suggested that. It shouldn't be difficult to implement, but I can't promise I'll have time to do that in the near future. You're welcome to take a stab at it and open a PR!

@dorg-ekrolewicz
Copy link

@jstypka Can you please indicate what the input format looks like? Is it embedding arrays for the inputs and one hot arrays for label?

@jstypka
Copy link
Collaborator

jstypka commented Oct 4, 2018

@dorg-ekrolewicz the output is one-hot arrays and the input is a 2D array - each row being a word represented as a word2vec vector. A batch of several document would make a 3D tensor. Does that help?

@dorg-ekrolewicz
Copy link

Are you using padding?

Ex for classifying cats and dogs: num_classes = 2
max_num_words = number of words in x = 10 (in this example)

Inputs:

  1. x = "the dog is red" y = [0,1] where num_words = 4
  2. x = "the cat and dog are blue" y = [1,1] where num_words = 6

Since we have m=2 examples, the input dimensions would be a (m, embedding_dim, max_num_words)?

@jstypka
Copy link
Collaborator

jstypka commented Oct 4, 2018

@dorg-ekrolewicz yes, that looks correct. We pad with 0s until max_num_words and throw a 0 vector if we don't have a representation for a word (unfamiliar vocabulary).

Pretty much all the code is in this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants