nlppre

Preprocessing for NLP applicaitons. It could also definitely be used for non-nlp applications. Totally your call.Supports Pandas DataFrames, arrays and lists, as well as strings.

Class: Corpus

nlppre.Corpus(text, columns=None, url=False): an object that holds the original text, as well as any cleaned or tokenized versions

text: the text you wish to process
columns: if inputting a Pandas DataFrame, columns is a list or string of the columns to import from that DataFrame.
url: if True, the text field will be instead treated as a URL leading to a .txt file for import.

Class Methods

nlppre.Corpus.clean(self, gutenburg=False, uncased=True, brackets=True, hyperlinks=True, punctuation=True, line_breaks=True, tabs=True, nums=True) cleans the text, removing html, brackets, etc

nlppre.Corpus.tokenize(self, preview=True): splits a body of text into a list of words

Class: NGram

nlppre.NGram(corpus, n=2, length=10): takes in a corpus of text and creates a list of N-grams. N-grams are a count of how many times a word or sequence of words appears in a text. For instance, a unigram is how many times a single word appears, bigrams are a count of how many times a sequence of two words appears, and so on.

corpus: a nlppre.Corpus() object
n: the n in N-grams
length: the length of the N-gram list to plot

Class Methods

nlppre.NGram.create(n=1, length=10): generates a list of ngrams

n: ngram number (1 for unigrams, 2 for bigrams, and so on)
length: length of the list of ngrams returned

nlppre.NGram.plot(n=1, length=10, color=Blue): plots the ngram count using Seaborn

n: ngram number (1 for unigrams, 2 for bigrams, and so on)
length: length of the list of ngrams returned
color: color range for the count plot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

nlppre

Class: Corpus

Class Methods

Class: NGram

Class Methods

Files

README.md

Latest commit

History

README.md

File metadata and controls

nlppre

Class: Corpus

Class Methods

Class: NGram

Class Methods