Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One question regarding padding #4

Open
Chandrak1907 opened this issue Dec 15, 2018 · 5 comments
Open

One question regarding padding #4

Chandrak1907 opened this issue Dec 15, 2018 · 5 comments

Comments

@Chandrak1907
Copy link

hi,

I see that you are padding the inputs to get equal length of 52. But, it seems padding is applied to only character inputs but not to words.

 # 0-pads all words
 def padding(Sentences):
     maxlen = 52
     for sentence in Sentences:
         char = sentence[2]
         for x in char:
             maxlen = max(maxlen, len(x))
     for i, sentence in enumerate(Sentences):
         Sentences[i][2] = pad_sequences(Sentences[i][2], 52, padding='post')
     return Sentences

Sentences contains below:

         dataset.append([wordIndices, caseIndices, charIndices, labelIndices]) 
     return dataset

I see that you have made batches of inputs with words of equal length. Is this the correct approach?
Can you pls let me know.

@mxhofer
Copy link
Owner

mxhofer commented Dec 15, 2018

Hello! That's correct. This is because the convolutional neural net (CNN) processes the padded character vectors of equal length. Alternatively, one could split the CNN input into equal-size batches (e.g. here). Each input batch to the bi-directional LSTM has the same length, depending on how many words there are in a document.

@Chandrak1907
Copy link
Author

Thank you for responding. One follow up question.
What was the rationale behind padding characters to a maximum length of 52. There will be 26 upper case letters, 26 lower case letters and other punctuation characters. Can you pls let me know?

@mxhofer
Copy link
Owner

mxhofer commented Dec 15, 2018

The maximum length was chosen after analyzing word lengths in the documents, such that no words are cut off.

@Chandrak1907
Copy link
Author

There is some confusion. In my understanding, padding is applied to only characters not to words.

@mxhofer
Copy link
Owner

mxhofer commented Dec 16, 2018

Padding is indeed applied to characters. For example, the padded character-level input below is for the word "RECORD". The output of the padding(Sentences) function is a list of documents of a list of words, cases, characters and labels (see output of the createMatrices(sentences, word2Idx, label2Idx, case2Idx, char2Idx) function.
screenshot 2018-12-16 at 14 40 57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants