Skip to content

This project is created to test different text vectorization techniques in order to perform further clusterization..

License

Notifications You must be signed in to change notification settings

bluella/Text-clusterization-overview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text clusterization overview

This is more of a cheat sheet, than a serious project with high goals. Data is Reddit posts for one year with #wot hashtag posts_Reddit_wot_en.csv. Text vectorization was done with four main methods: BoW, TF-IDF, PV-DM, PV-DBOW,

Clusterization method is always K-means++, just because i believe modification of it makes little impact compared to change of vectorization technique. Visualization is performed via: MDS, PCA

Install

git clone git@github.com:bluella/Text-clusterization-overview.git
cd Text-clusterization-overview
virtualenv -p /usr/bin/python3.7 tco_env
source ./tco_env/bin/activate
pip install -r requirements.txt

You are good to go!

Results

TF-IDF has shown best results among other vectorization methods. BoW is a bit less accurate. PV-DM and PV-DBOW deliveres really weird results. Pephaps because of small dataset size, which is not appropriate to proper model learning. PCA visualization seems to comply more with real outcome than MDS.

Futher development

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Heavy loads of code were taken from the following resources:

About

This project is created to test different text vectorization techniques in order to perform further clusterization..

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages