Text clusterization overview

This is more of a cheat sheet, than a serious project with high goals. Data is Reddit posts for one year with #wot hashtag posts_Reddit_wot_en.csv. Text vectorization was done with four main methods: BoW, TF-IDF, PV-DM, PV-DBOW,

Clusterization method is always K-means++, just because i believe modification of it makes little impact compared to change of vectorization technique. Visualization is performed via: MDS, PCA

Install

git clone git@github.com:bluella/Text-clusterization-overview.git
cd Text-clusterization-overview
virtualenv -p /usr/bin/python3.7 tco_env
source ./tco_env/bin/activate
pip install -r requirements.txt

You are good to go!

Results

TF-IDF has shown best results among other vectorization methods. BoW is a bit less accurate. PV-DM and PV-DBOW deliveres really weird results. Pephaps because of small dataset size, which is not appropriate to proper model learning. PCA visualization seems to comply more with real outcome than MDS.

Futher development

Proper clusterization evaluation
Use pretrained model for PV-DM with help of fasttext or else

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Heavy loads of code were taken from the following resources:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
datasets		datasets
tco		tco
.editorconfig		.editorconfig
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text clusterization overview

Install

Results

Futher development

License

Acknowledgments

About

Releases

Packages

Languages

License

bluella/Text-clusterization-overview

Folders and files

Latest commit

History

Repository files navigation

Text clusterization overview

Install

Results

Futher development

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages