classify_cattle_disease

Code to preprocess and classify the dataset

Combine health events and comments

To combine health events and comments you can use the 'combine' class. It takes the 'AllEventsMapped' as its dataframes and a datedifference to match. The datedifference is set to 10 on default. Note that one has too go through the results manually to remove any noise. It is not possible to match one on one since that results in almost no results.

Translate

To translate the comments of a dataframe to English, the 'translate' class can be used. The translate class will detect the language, correct the spelling, and translate the comments that are non-English. Out of the box, it will use the stochastic gradient descent algorithm of SVM with 600 features and bigrams. The possible algorithms are:

Naive Bayes
Bernoulli Bayes
Multinomial Bayes
Logistic Regression
SGD (Stochastic Gradient Descent version of SVM)

All algorithms support the following n-gram feature combinations:

1: 300
2: 300, 600, 1200, 2000
3: 300, 600, 1200, 3000
4: 300, 600, 1200, 2000
5: 300, 600, 1200

The class uses google translate to translate the comments to English, although this part is currently not working.

Preprocessing

The 'preprocess' class takes care of stop word removal, enrichment, stemming, and lemmitization. The comments are enriched based on a lookup dictionary containing information about medicines and diseases. Of the comment contains neither of those, it is enriched using the TAGME procedure based on Wikipedia articles. The named entities are recognized using the Stanford NER tagger. The resulting dataframe will contain enriched, stemmed comments.

Classification

The 'classify' class takes care of the classifying and prediction of comments based on the health conditions. It currently supports the LDA classification, L-LDA classification and the more classical approaches.

Latent Dirichlet Allocation

The LDA classifier takes in a combination of topics, passes, decay rate, iterations, and minimum probability. It uses a lookup dictionary to interpret the results. The models are stored in 'models/unsupervised'. The possible combinations are:

6 topics, 100 passes, decay rate 6, 10 iterations
6 topics, 250 passes, decay rate 6, 10 iterations
6 topics, 500 passes, decay rate 6, 10 iterations
7 topics, 250 passes, decay rate 6, 10 iterations
8 topics, 100 passes, decay rate 6, 10 iterations
8 topics, 250 passes, decay rate 6, 10 iterations
8 topics, 500 passes, decay rate 6, 10 iterations
9 topics, 250 passes, decay rate 6
11 topics, 100 passes, decay rate 6, 10 iterations
11 topics, 250 passes, decay rate 6, 10 iterations
11 topics, 500 passes, decay rate 6, 10 iterations
13 topics, 100 passes, decay rate 6, 10 iterations
13 topics, 250 passes, decay rate 6, 10 iterations
13 topics, 500 passes, decay rate 6, 10 iterations

The minimum probability can be set as a value between 0 and 1, it is the threshold on which the algorithm will return the result. If the algorithm returns a possibility score higher than the threshold, the comment is classified accordingly. The lookups can be found in lookups.txt

Labeled Latent Dirichlet Allocation

The L-LDA classifier takes a combintion of topics, passes, and a minimum probability.

The possible combinations are:

6 topics, 50 passes
6 topics, 150 passes
6 topics, 250 passes
7 topics, 50 passes
7 topics, 150 passes
7 topics, 250 passes
8 topics, 50 passes
8 topics, 150 passes
8 topics, 250 passes
10 topics, 50 passes
10 topics, 150 passes
10 topics, 250 passes
13 topics, 50 passes
13 topics, 150 passes
13 topics, 250 passes

The minimum probability can be set as a value between 0 and 1, it is the threshold on which the algorithm will return the result. If the algorithm returns a possibility score higher than the threshold, the comment is classified accordingly. The lookups can be found in lookups.txt

Classical approaches

The algorithms used in the classical approach are: 1) Naive Bayes, 2) Logistic Regression, 3) Multinomial Bayes, 3) SGD (version of SVM). It takes a number of topics, n-grams, features, and lookup The possible topics are: 1) 7, 2) 8, 3) 11, 4) 13.

The possible combinations of n-grams and features are:

1 ngram: 1000 features, 2500 features, 5000 features.
2 ngram: 3000 features, 10000 features, 20000 features.
3 ngram: 3000 features, 10000 features, 20000 features.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
abbreviations		abbreviations
enrich_info		enrich_info
language detection		language detection
models		models
stanford-ner		stanford-ner
symspell		symspell
.gitignore		.gitignore
Pipeline.ipynb		Pipeline.ipynb
README.md		README.md
classify.py		classify.py
combine.py		combine.py
contractions.py		contractions.py
demo.csv		demo.csv
errors.py		errors.py
labeled_lda.py		labeled_lda.py
lookups.txt		lookups.txt
preprocess.py		preprocess.py
spellcheck.py		spellcheck.py
test-dataset.csv		test-dataset.csv
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

classify_cattle_disease

Combine health events and comments

Translate

Preprocessing

Classification

Latent Dirichlet Allocation

Labeled Latent Dirichlet Allocation

Classical approaches

About

Languages

Bovi-analytics/classify_cattle_disease

Folders and files

Latest commit

History

Repository files navigation

classify_cattle_disease

Combine health events and comments

Translate

Preprocessing

Classification

Latent Dirichlet Allocation

Labeled Latent Dirichlet Allocation

Classical approaches

About

Topics

Resources

Stars

Watchers

Forks

Languages