GitHub - nicolaCirillo/conll_iterator: A simple iterator that reads conll and conllu files (https://universaldependencies.org/format.html) without keeping them in memory. It can iterate over words, sentences, or documents.

CONTENTS OF THIS FILE

Introduction
Setup
Getting started
Examples

INTRODUCTION

A simple iterator that reads conll and conllu files (https://universaldependencies.org/format.html) without keeping them in memory. It can iterate over words, sentences, or documents.

SETUP

via pip:

pip install conll-iterator

via github:

pip install git+https://github.com/nicolaCirillo/conll_iterator.git

GETTING STARTED

from conll_iterator import ConllIterator
from tqdm import tqdm


sentences = ConllIterator('sample/sample_corpus.conllu', 
                         fields=['form', 'upos'], 
                         mode='sentences', 
                         join_char='/')

for s in tqdm(sentences):
    # do something

EXAMPLES

Training word2vec

from gensim.models import Word2Vec
from tqdm import tqdm
from conll_iterator import ConllIterator

sentences = ConllIterator('sample/sample_corpus.conllu', fields=['lemma', 'upos'], mode='sentences', join_char='/')
w2v_parameters = {'vector_size': 25, 'window': 5, 'min_count': 1, 'sg': 1, 'epochs': 15}
model = Word2Vec(tqdm(sentences), workers=5, **w2v_parameters)
model.save('sample_w2v')
word_vectors = model.wv
similar = list(zip(*word_vectors.most_similar('Pecorino/PROPN')[:10]))[0]
print("Most similar words to Pecorino/PROPN:")
print(similar)

Keyword extraction via tf-idf

from itertools import chain
from collections import Counter

docs = ConllIterator('sample/sample_corpus.conllu', fields=['lemma', 'upos'], lower=['lemma'], mode='documents')
doc_tf = list()
df = Counter()
allowed_pos = ['NOUN', 'PROPN','VERB', 'ADJ']
for d in docs:
    tokens = list(chain(*d))
    tokens = [t[0] for t in tokens if t[1] in allowed_pos]
    tf = Counter(tokens)
    df.update(set(tokens))
    doc_tf.append(tf)

doc_keywords = list()
for d in doc_tf:
    doc_tfidf = [(w, d[w]/df[w]) for w in d]
    doc_tfidf = sorted(doc_tfidf, key=lambda x:x[1], reverse=True)
    doc_keywords.append(list(zip(*doc_tfidf[:10]))[0])

for i, k in enumerate(doc_keywords[:20]):
    print('keywords of doc {}:'.format(i+1), '; '.join(k))

Sequence extraction based on dependency tree

from conll_iterator import ConllIterator, DependencyExtractor

xpaths = """//word[@upos="VERB" and word[@deprel="nsubj" and word[@upos="DET"]] and word[@deprel="obj"]]
//word[@upos="VERB" and word[@deprel="nsubj"] and word[@deprel="obj"]]
"""

extractor = DependencyExtractor(xpaths, "form")
sentences = ConllIterator("sample/sample_corpus.conllu")
counter = extractor.count(sentences)
for s, f in counter.most_common(10):
    print(f"sequence: {s}; frequency: {f}")

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
conll_iterator		conll_iterator
sample		sample
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
example.ipynb		example.ipynb
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CONTENTS OF THIS FILE

INTRODUCTION

SETUP

GETTING STARTED

EXAMPLES

Training word2vec

Keyword extraction via tf-idf

Sequence extraction based on dependency tree

About

Releases

Packages

Languages

License

nicolaCirillo/conll_iterator

Folders and files

Latest commit

History

Repository files navigation

CONTENTS OF THIS FILE

INTRODUCTION

SETUP

GETTING STARTED

EXAMPLES

Training word2vec

Keyword extraction via tf-idf

Sequence extraction based on dependency tree

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages