Skip to content

A neural text process python lib for context-based feature extraction on Seq-Tagging data.

Notifications You must be signed in to change notification settings

heshenghuan/ContextFeatureExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Text Process lib

1 Introduction

A neural text process python lib for sequence tagging data generating.

Support feature template which used to extract context-based feature from text. Support hybrid feature template which often been used in Neural Network sequence labeling.

2 Fields

This lib used 'fields' to specify input data's format. In template file, you could see it's definition at the first and second line.

In particular, there are several reserved fields named 'w', 'y', 'x' and 'F', which are used to represent token in text, label in text, representation of current token and corresponding feature respectively. You should never use them to specify your own addtional feature.

For example, there are some datas:

我 C S
爱 C S
北 C B
京 C E
天 C B
安 C M
门 C E
。 P S

Each line is consist of multiple columns, but the first column is token itself(field 'w') and the last column is it's label(field 'y'). The second column of each line is the character type of token, for example a hanzi character is 'C', a letter is 'E', a number is 'N' and a punctuation is 'P'.

So you can define a template for this data like:

# Fields
w T y
# Templates
w:-1
w: 0
w: 1
T: 0

Which used field name 'T' to specify the second column. You can use any string but not {'w', 'y', 'x' and 'F'} to assign a field name.

3 Basic Feature Template

A basic feature template(src.feature.Template) is used to extract context-based feature for text.

3.1 Prefix

Support feature templates prefixes enabled or disabled. For example, there are few context-based feature templates:

# Fields
w y
# Templates
w:-2
w:-1
w: 0
w: 1
w: 2

Given the sentence "我爱北京天安门。", then it will extract features for "北" and "京" as:

  1. Prefix enbaled
'北': ['w[-2]:我', 'w[-1]:爱', 'w[0]:北', 'w[1]:京', 'w[2]:天']
'京': ['w[-2]:爱', 'w[01]:北', 'w[0]:京', 'w[1]:天', 'w[2]:安']
  1. Prefix disabled
'北': ['我', '爱', '北', '京', '天']
'京': ['爱', '北', '京', '天', '安']

The prefix 'w[n]:' disappeared. Disabled prefixes can be used to extract raw word from a window.

3.2 Usage

It is easy to use class Template, just type temp = Template(template_file, prefix), and then use temp as a parameter.

4 HybridTemplate

A HybridTemplate(src.features.HybridTemplate) is a combination of prefix-enabled Template and prefix-disabled Template. It will generate both window-repr and context-feature.

4.1 Explanation

For example, if the window size equals to 3 which means each token is represente by it's left and right neighboring tokens. And the template is:

# Fields
w y
# Templates
w:-2
w:-1
w: 0
w: 1
w: 2

Given the sentence "我爱北京天安门。", then it will use these to represente "北" and "京":

'北': ['爱', '北', '京']
'京': ['北', '京', '天']

And it will extract features for "北" and "京" as(default prefix enabled):

'北': ['w[-2]:我', 'w[-1]:爱', 'w[0]:北', 'w[1]:京', 'w[2]:天']
'京': ['w[-2]:爱', 'w[01]:北', 'w[0]:京', 'w[1]:天', 'w[2]:安']

4.2 Usage

It is easy to use class HybridTemplate, just type temp = HybridTemplate(template_file, window), and then use temp as a parameter.

5 Evaluation for BIO/BISO tagged sequences

Evaluation method for BIO/BISO tagged sequences has been offered in this project. The label must conform to the following format:

O
B-name or I-name

O
B-name or I-name
S-name # Single token entity

For example, in NER tasks usually, has multiple kinds entity waited to be recognized. They usually are 'PER', 'LOC' and 'ORG'.

So in this case, the label set could be:

# others
O
# Begin token of a certain type entity
B-PER
B-ORG
B-LOC
# Inside token of a certain type entity
I-PER
I-LOC
I-ORG

But if the label set is not subdivided like this, just attach a suffix like '-ANYTHING' after any non-O label to prevent the program from going wrong.

History

  • 2018-07-28 ver 0.2.2
    • Evaluation for BIO/BISO tagged sequence update.
  • 2018-01-09 ver 0.2.1
    • Prefix, not suffix(Ah my poor English:sweat_smile:).
  • 2017-10-30 ver 0.2.0
    • New HybridTemplate support
      • Window-representation for current token, i.e. Xt = [Wt-l,...,Wt+r], you can represent Wt by concatenating vector.
      • Tranditional context-based feature.
    • Compatible modification.
      • Reserved fields are now {'w', 'y', 'x', 'F'}. And use field 'x' to represent Wt now.
      • The map word2idx is generated by statistication on field 'x'.
      • The shape of returned tensor of method 'src.pretreatment.conv_corpus' has changed.
  • 2017-09-25 ver 0.1.3
    • Add new method to return the size of feature templates
    • Replace both 'START'&'END' tag with ''
  • 2017-09-12 ver 0.1.2
    • label2idx's index starts from 1
    • Index of unknow words or labels will be 0
  • 2017-09-04 ver 0.1.1
    • Index of feature ‘OOV’ set to default 0
    • label2idx's index starts from 0
  • 2017-08-26 ver 0.1.0
    • First version

About

A neural text process python lib for context-based feature extraction on Seq-Tagging data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages