Neural Text Process lib

1 Introduction

A neural text process python lib for sequence tagging data generating.

Support feature template which used to extract context-based feature from text. Support hybrid feature template which often been used in Neural Network sequence labeling.

2 Fields

This lib used 'fields' to specify input data's format. In template file, you could see it's definition at the first and second line.

In particular, there are several reserved fields named 'w', 'y', 'x' and 'F', which are used to represent token in text, label in text, representation of current token and corresponding feature respectively. You should never use them to specify your own addtional feature.

For example, there are some datas:

我 C S
爱 C S
北 C B
京 C E
天 C B
安 C M
门 C E
。 P S

Each line is consist of multiple columns, but the first column is token itself(field 'w') and the last column is it's label(field 'y'). The second column of each line is the character type of token, for example a hanzi character is 'C', a letter is 'E', a number is 'N' and a punctuation is 'P'.

So you can define a template for this data like:

# Fields
w T y
# Templates
w:-1
w: 0
w: 1
T: 0

Which used field name 'T' to specify the second column. You can use any string but not {'w', 'y', 'x' and 'F'} to assign a field name.

3 Basic Feature Template

A basic feature template(src.feature.Template) is used to extract context-based feature for text.

3.1 Prefix

Support feature templates prefixes enabled or disabled. For example, there are few context-based feature templates:

# Fields
w y
# Templates
w:-2
w:-1
w: 0
w: 1
w: 2

Given the sentence "我爱北京天安门。", then it will extract features for "北" and "京" as:

Prefix enbaled

'北': ['w[-2]:我', 'w[-1]:爱', 'w[0]:北', 'w[1]:京', 'w[2]:天']
'京': ['w[-2]:爱', 'w[01]:北', 'w[0]:京', 'w[1]:天', 'w[2]:安']

Prefix disabled

'北': ['我', '爱', '北', '京', '天']
'京': ['爱', '北', '京', '天', '安']

The prefix 'w[n]:' disappeared. Disabled prefixes can be used to extract raw word from a window.

3.2 Usage

It is easy to use class Template, just type temp = Template(template_file, prefix), and then use temp as a parameter.

4 HybridTemplate

A HybridTemplate(src.features.HybridTemplate) is a combination of prefix-enabled Template and prefix-disabled Template. It will generate both window-repr and context-feature.

4.1 Explanation

For example, if the window size equals to 3 which means each token is represente by it's left and right neighboring tokens. And the template is:

# Fields
w y
# Templates
w:-2
w:-1
w: 0
w: 1
w: 2

Given the sentence "我爱北京天安门。", then it will use these to represente "北" and "京":

'北': ['爱', '北', '京']
'京': ['北', '京', '天']

And it will extract features for "北" and "京" as(default prefix enabled):

'北': ['w[-2]:我', 'w[-1]:爱', 'w[0]:北', 'w[1]:京', 'w[2]:天']
'京': ['w[-2]:爱', 'w[01]:北', 'w[0]:京', 'w[1]:天', 'w[2]:安']

4.2 Usage

It is easy to use class HybridTemplate, just type temp = HybridTemplate(template_file, window), and then use temp as a parameter.

5 Evaluation for BIO/BISO tagged sequences

Evaluation method for BIO/BISO tagged sequences has been offered in this project. The label must conform to the following format:

O
B-name or I-name

O
B-name or I-name
S-name # Single token entity

For example, in NER tasks usually, has multiple kinds entity waited to be recognized. They usually are 'PER', 'LOC' and 'ORG'.

So in this case, the label set could be:

# others
O
# Begin token of a certain type entity
B-PER
B-ORG
B-LOC
# Inside token of a certain type entity
I-PER
I-LOC
I-ORG

But if the label set is not subdivided like this, just attach a suffix like '-ANYTHING' after any non-O label to prevent the program from going wrong.

History

2018-07-28 ver 0.2.2
- Evaluation for BIO/BISO tagged sequence update.
2018-01-09 ver 0.2.1
- Prefix, not suffix(Ah my poor English:sweat_smile:).
2017-10-30 ver 0.2.0
- New HybridTemplate support
  - Window-representation for current token, i.e. Xt = [Wt-l,...,Wt+r], you can represent Wt by concatenating vector.
  - Tranditional context-based feature.
- Compatible modification.
  - Reserved fields are now {'w', 'y', 'x', 'F'}. And use field 'x' to represent Wt now.
  - The map word2idx is generated by statistication on field 'x'.
  - The shape of returned tensor of method 'src.pretreatment.conv_corpus' has changed.
2017-09-25 ver 0.1.3
- Add new method to return the size of feature templates
- Replace both 'START'&'END' tag with ''
2017-09-12 ver 0.1.2
- label2idx's index starts from 1
- Index of unknow words or labels will be 0
2017-09-04 ver 0.1.1
- Index of feature ‘OOV’ set to default 0
- label2idx's index starts from 0
2017-08-26 ver 0.1.0
- First version

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
template		template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Text Process lib

1 Introduction

2 Fields

3 Basic Feature Template

3.1 Prefix

3.2 Usage

4 HybridTemplate

4.1 Explanation

4.2 Usage

5 Evaluation for BIO/BISO tagged sequences

History

About

Releases

Packages

Languages

heshenghuan/ContextFeatureExtractor

Folders and files

Latest commit

History

Repository files navigation

Neural Text Process lib

1 Introduction

2 Fields

3 Basic Feature Template

3.1 Prefix

3.2 Usage

4 HybridTemplate

4.1 Explanation

4.2 Usage

5 Evaluation for BIO/BISO tagged sequences

History

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages