Skip to content

Commit

Permalink
Merge pull request #164 from Nickname1230/master
Browse files Browse the repository at this point in the history
update code
  • Loading branch information
Bond-H committed Jan 4, 2021
2 parents 7677216 + a042e2c commit e79336a
Show file tree
Hide file tree
Showing 17 changed files with 818 additions and 340 deletions.
4 changes: 4 additions & 0 deletions Changelog
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
2020-10-25: version 2.1.0
1. 词性标注模型粒度更新为字词混合,在效果保持不变的情况下,性能提升最高可达20%。
2. 新增加词语关键度分类功能,在维持LAC词性标注结果下可以标注词语关键程度。
3. 修复了模型训练速度慢问题。
42 changes: 40 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## 工具介绍
LAC全称Lexical Analysis of Chinese,是百度自然语言处理部研发的一款联合的词法分析工具,实现中文分词、词性标注、专名识别等功能。该工具具有以下特点与优势:
- **效果好**:通过深度学习模型联合学习分词、词性标注、专名识别任务,整体效果F1值超过0.91,词性标注F1值超过0.94,专名识别F1值超过0.85,效果业内领先。
- **效果好**:通过深度学习模型联合学习分词、词性标注、专名识别任务,词语重要性,整体效果F1值超过0.91,词性标注F1值超过0.94,专名识别F1值超过0.85,效果业内领先。
- **效率高**:精简模型参数,结合Paddle预测库的性能优化,CPU单线程性能达800QPS,效率业内领先。
- **可定制**:实现简单可控的干预机制,精准匹配用户词典对模型进行干预。词典支持长片段形式,使得干预更为精准。
- **调用便捷****支持一键安装**,同时提供了Python、Java和C++调用接口与调用示例,实现快速调用和集成。
Expand All @@ -16,7 +16,7 @@ LAC全称Lexical Analysis of Chinese,是百度自然语言处理部研发的
代码兼容Python2/3
- 全自动安装: `pip install lac`
- 半自动下载:先下载[http://pypi.python.org/pypi/lac/](http://pypi.python.org/pypi/lac/),解压后运行 `python setup.py install`
- 安装完成后可在命令行输入`lac``lac --segonly`启动服务,进行快速体验
- 安装完成后可在命令行输入`lac``lac --segonly`,`lac --rank`启动服务,进行快速体验

> 国内网络可使用百度源安装,安装速率更快:`pip install lac -i https://mirror.baidu.com/pypi/simple`
Expand Down Expand Up @@ -82,6 +82,44 @@ lac_result = lac.run(texts)
| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 |
| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 |

#### 词语重要性
- 代码示例:
```python
from LAC import LAC

# 装载词语重要性模型
lac = LAC(mode='rank')

# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
rank_result = lac.run(text)

# 批量样本输入, 输入为多个句子组成的list,平均速率会更快
texts = [u"LAC是个优秀的分词工具", u"百度是一家高科技公司"]
rank_result = lac.run(texts)
```
- 输出:

```text
【单样本】:rank_result = [['LAC', '是', '个', '优秀', '的', '分词', '工具'],
[nz, v, q, a, u, n, n],[3, 0, 0, 2, 0, 3, 1]]
【批量样本】:rank_result = [
(['LAC', '是', '个', '优秀', '的', '分词', '工具'],
[nz, v, q, a, u, n, n], [3, 0, 0, 2, 0, 3, 1]),
(['百度', '是', '一家', '高科技', '公司'],
[ORG, v, m, n, n], [3, 0, 2, 3, 1])
]
```
词语重要性各类别标签集合如下表,我们使用4-Level梯度进行分类:

| 标签 | 含义 | 常见于词性|
| ---- | -------- | ---- |
| 0 | query中表述的冗余词 | p, w, xc ... |
| 1 | query中限定较弱的词 | r, c, u ... |
| 2 | query中强限定的词 | n, s, v ... |
| 3 | query中的核心词 | nz, nw, LOC ... |


#### 定制化功能

在模型输出的基础上,LAC还支持用户配置定制化的切分结果和专名类型输出。当模型预测匹配到词典的中的item时,会用定制化的结果替代原有结果。为了实现更加精确的匹配,我们支持以由多个单词组成的长片段作为一个item。
Expand Down
4 changes: 2 additions & 2 deletions python/LAC/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
################################################################################

name = 'lac'
version = "2.0.5"
version_info = (2, 0, 5, 0)
version = "2.1.0"
version_info = (2, 1, 0, 0)

from .lac import LAC
15 changes: 12 additions & 3 deletions python/LAC/cmdline.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@

import argparse
parser = argparse.ArgumentParser(description='LAC Init Argments')
parser.add_argument('--segonly', action='store_true',
parser.add_argument('--segonly', action='store_true',
help='run segment only if setting')
parser.add_argument('--rank', action='store_true',
help='run rank model if setting')
args = parser.parse_args()

__all__ = [
Expand All @@ -43,9 +45,12 @@ def main(args=args):

if args.segonly:
lac = LAC(mode='seg')
elif args.rank:
lac = LAC(mode='rank')
else:
lac = LAC()


while True:
line = sys.stdin.readline()
if not line:
Expand All @@ -54,9 +59,13 @@ def main(args=args):
line = strdecode(line.strip())
if args.segonly:
print(u" ".join(lac.run(line)))
else:
elif args.rank:
words, tags, words_rank = lac.run(line)
print(u" ".join(u"%s/%s" % (word, rank)
for word, tag, rank in zip(words, tags, words_rank)))
else :
words, tags = lac.run(line)
print(u" ".join(u"%s/%s" % (word, tag)
for word, tag in zip(words, tags)))

return 0
return 0
4 changes: 2 additions & 2 deletions python/LAC/custom.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@
import logging

try:
from .triedtree import TriedTree
from ._compat import strdecode
from .prefix_tree import TriedTree
except:
from triedtree import TriedTree
from _compat import strdecode
from prefix_tree import TriedTree


class Customization(object):
Expand Down
Loading

0 comments on commit e79336a

Please sign in to comment.