Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess Text: add option to extract specific (keyword) N-grams #1011

Open
wvdvegte opened this issue Oct 13, 2023 · 1 comment
Open

Preprocess Text: add option to extract specific (keyword) N-grams #1011

wvdvegte opened this issue Oct 13, 2023 · 1 comment
Labels
enhancement meal This will take a day or two text expert Requires knowledge of Text add-on. wishlist

Comments

@wvdvegte
Copy link

Is your feature request related to a problem? Please describe.
Certain types of documents, such as scientific publications, are often often accompanied by a list of keywords that typically contain N-grams, such as "generative neural networks", "fertility rates" or "consumer preferences". If metadata of the publications can be downloaded, these often appear in a separate column, separated by commas, semicolons or some other separator. Using Preprocess Text, these N-grams can easily be extracted by tokenization using the regexp [^;]+ (for ";" as separator).
When analyzing the full texts or abstracts, it would be very useful if these same N-grams are also recognized as belonging together and not as separate words - including keyword N-grams from other documents that appear in a document's main text (but not in its keywords). Of course, N-grams can be extracted defining an N-grams range in Preprocess Text, but this will produce also many meaningless or less meaningful N-grams, especially if in-between stopwords have already been removed.

Describe the solution you'd like
Ideally I would like to be able to connect two corpora as input to Preprocess Text, one with the main texts or abstracts from all documents and one with all the keywords, 1-grams and N-grams from all documents, presumably tokenized with Preprocess Text already. The second input is only used for a "keyword N-gram construction" step after tokenization of the first input (not necessarily at the end, like regular N-gram construction).
Another option would be to allow for specifying an "N-gram keyword lexicon" file in the Filtering step, but that would require a two-step approach where the list of keywords has to be re-created and reloaded each time when documents are being added

Describe alternatives you've considered
As said, use the regular N-gram construction option, which produces a lot of noise

@ajdapretnar
Copy link
Collaborator

I had a similar idea for adding n-grams from Collocations. It would be nice to have a separate input for "appending" to tokens. The output would be added to Preprocess Text and to Collocations, the input to Preprocess Text.
@PrimozGodec Is this a viable option?

@ajdapretnar ajdapretnar added enhancement wishlist meal This will take a day or two text expert Requires knowledge of Text add-on. labels Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement meal This will take a day or two text expert Requires knowledge of Text add-on. wishlist
Projects
None yet
Development

No branches or pull requests

2 participants