Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can i use another corpus? #5

Open
ludens11 opened this issue Nov 28, 2016 · 3 comments
Open

Can i use another corpus? #5

ludens11 opened this issue Nov 28, 2016 · 3 comments

Comments

@ludens11
Copy link

ludens11 commented Nov 28, 2016

Thanks for the awesome works! It really help me. Im just a beginner about NLP stuff. But i need your explanaton in this part :

if corpus.lower() == "brown":
from nltk.corpus import brown
tagged_sents = brown.tagged_sents()[:num_sents]
elif corpus.lower() == "treebank":
from nltk.corpus import treebank
tagged_sents = treebank.tagged_sents()[:num_sents]
else:
print "Please load either the 'brown' or the 'treebank' corpus."

is it possible to modify the given parameter of corpus to another document? i planning to use Indonesian document filled with tweets. So far, i got trained data of Indonesian words ( https://github.com/drr3d/BimaNLP/tree/master/dataset ). Can this maxent-pos-tagger work same as the english corpus? Thank you very much!

@arne-cl
Copy link
Owner

arne-cl commented Dec 8, 2016

Hi callmefregy,

you can train the tagger on any corpus of pos-tagged sentences (your dataset seems
only to contain tagged words).

maxent_tagger = MaxentPosTagger()                                                                     
maxent_tagger.train(train_sents)
maxent_tagger.tag(["This", "is", "a", "new", "sentence", "!"])

train_sents has to be a list of sentences, where each sentence is represented by a list of (token, POS tag) tuples, e.g.

[(u'Pierre', u'NNP'),
 (u'Vinken', u'NNP'),
 (u',', u','),
 (u'61', u'CD'),
 (u'years', u'NNS'),
 (u'old', u'JJ'),
 (u',', u','),
 (u'will', u'MD'),
 (u'join', u'VB'),
 (u'the', u'DT'),
 (u'board', u'NN'),
 (u'as', u'IN'),
 (u'a', u'DT'),
 (u'nonexecutive', u'JJ'),
 (u'director', u'NN'),
 (u'Nov.', u'NNP'),
 (u'29', u'CD'),
 (u'.', u'.')]

I guess you could use this corpus: http://www.panl10n.net/english/OutputsIndonesia2.htm

@ludens11
Copy link
Author

ludens11 commented Mar 2, 2017

Thanks for your reply. Actually, im a bit confuse how to use this script. I already got some guide to installing the MEGAM. http://stackoverflow.com/questions/12606543/nltk-megam-max-ent-algorithms-on-windows . could you give me additional suggestion in order to make this script work perfectly? Im running this script on Windows system. your help would be greatly helped me.

@arne-cl
Copy link
Owner

arne-cl commented Mar 2, 2017

In my last message I gave you a usage example. Which step does not work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants