Skip to content

SkyAndCloud/HMM-pos-tagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HMM pos tagger

A toy pos tagger applied Hidden Markov Model.

Requirements

  • python3
  • numpy
  • sklearn

Usage

  1. data format example
token1/tag1 token2/tag2 token3/tag3 ...

see data/raw_data.txt for more details

  1. split data/raw_data.txt into data/train.txt and data/test.txt with ratio 4:1
import random
with open("data/raw_data.txt", "r", encoding="utf-8") as f:
  data = f.readlines()
random.shuffle(data)
pivot = int(0.2 * len(data))
testset = data[:pivot]
trainset = data[pivot:]
with open("data/train.txt", "w", encoding="utf-8") as f:
  f.write(trainset)
with open("data/test.txt", "w", encoding="utf-8") as f:
  f.write(testset)
  1. evaluate hmm's initialzation matrix, transition matrix and emission matrix on trainset, which will be generated by train.py and cached as initial_np.pkl, transit_np.pkl and emit_np.pkl
python train.py

It also caches lookup table as token2idx.json and tag2idx.json.

  1. evaluate micro-f1, precision, recall and accuracy on testset using hmm model learned by step 3.
$ python test.py
micro-f1 score: 0.7452485032055516
precision score: 0.7452485032055516
recall score: 0.7452485032055516
accuracy score: 0.7452485032055516

It also caches hypothesis as pred.txt

About

A toy pos tagger applied Hidden Markov Model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages