中文字符向量 #18

michael92ht · 2018-05-31T03:19:41Z

什么时候可以提供中文字符的向量表示

shenshen-hungry · 2018-05-31T04:56:30Z

最近一直都在做词向量方面的工作，字向量稍后会提供。
在我个人看来，下游任务通常分词做比不分词效果普遍要好。可能字向量各种信息压缩程度太高了。

rubby33 · 2018-06-03T00:04:06Z

@shenshen-hungry 谢谢你们，但是在一些特定场景中，句子或话语很短的时候，字向量还是有些必要。辛苦啦

zhezhaoa · 2018-06-06T02:25:25Z

@michael92ht @rubby33 谢谢你们的建议，我们会尽快加上。如果你们现在就需要可以去 various co-occurrence information的表里面去下载context vector，然后把长度为一的拿出来就是字向量哈。这个向量也会有不错的性质，比如“吃”和“食”，“厅”和“堂” 会很相近

aluminumbox · 2018-06-07T08:03:05Z

@zhezhaoa 你是指Co-occurrence Type为Word → Character (1)的Context Word Vectors嘛？还是说任意一个Context Word Vectors都可以呢？

shenshen-hungry · 2018-06-07T10:00:21Z

@aluminumbox character的context里面会包含char的ngram，其中会有unigram也就是单个的汉字。

zhezhaoa · 2018-06-08T04:17:31Z

@aluminumbox 是的，把Word → Character (1)的中的context vector拿出来，只保留里面的单个汉字，就是中文的字向量

michael92ht · 2018-06-12T11:42:05Z

@zhezhaoa @shenshen-hungry 试了一下，里面单个字符的字符还是比较丰富的，一些非常特殊的字符都有相应的向量表示。相对来说还是比较大的，希望只对常见的中文字符做字嵌入就好。还是希望有专门为中文字符做的字符向量。

WangHexie · 2018-06-12T12:57:37Z

@zhezhaoa @shenshen-hungry 试了一下，里面单个字符的字符还是比较丰富的，一些非常特殊的字符都有相应的向量表示。相对来说还是比较大的，希望只对常见的中文字符做字嵌入就好。还是希望有专门为中文字符做的字符向量。

@shenshen-hungry 放字符向量的时候，能否按出现频率进行排序？
字典数将近七十万字相对来说确实有点大，如果能按出现频率排好，要多少个直接readline()方便很多，不用自己挑常用字

shenshen-hungry · 2018-06-12T15:35:58Z

@WangHexie 所有词向量文件里面的token都是按字符频率排序的，从高到低依次减少。

EricLingRui · 2018-10-16T06:37:07Z

能不能直接把词频放上？老铁

shenshen-hungry · 2018-10-16T09:03:51Z

@EricLingRui 近期没有放词频的计划。实际上，词频对于大部分需要词向量的任务来说并没有什么特别重要的意义。而且词向量文件是按词频顺序排列，基本上能满足绝大部分需求了。

AliceHuJie · 2018-10-23T02:47:47Z

哪里可以拿到训练后的字向量呢?

shenshen-hungry · 2018-10-23T03:41:13Z

@AliceHuJie 在主页Various Co-occurrence Information里面的Feature Character的Context Word Vectors。

sunnychou0330 · 2018-11-09T02:35:37Z

单纯的word和character 都比较容易训练，想问一下训练word+character和word +character+n-gram，是怎么训练的，神经网络模型的输入是什么，实现的原理能讲一下吗？有没有代码？或者是已发表的论文都可以的谢谢您

shenshen-hungry · 2018-11-09T03:23:13Z

@sunnychou0330 我们使用ngram2vec项目的工具包训练的，具体可以参考ngram2vec论文。
word+char的原理大致如图：

sunnychou0330 · 2018-11-09T06:52:15Z

@shenshen-hungry 谢谢您，我去了解一下，不会的再来咨询您

Ramlinbird · 2018-11-30T03:54:16Z

@WangHexie 所有词向量文件里面的token都是按字符频率排序的，从高到低依次减少。

现在想基于词向量，生成句向量，想要参考SIF的做法，里面会有用到词频文件。可否将词频也分享一下呢～（百度百科语料）

shenshen-hungry · 2018-11-30T04:18:52Z

@Ramlinbird 其实用Zipf's Law估算一下就可以:)

Lindahe0707 · 2018-12-01T01:18:00Z

@shenshen-hungry 请问一下 co-currence那里，Word → Character (1)里面的（1）是什么意思呢我看到下面有（1-2）（1-4）。谢谢

shenshen-hungry · 2018-12-01T12:11:02Z

@Lindahe0707 后面的数字指的是character级别的Ngram，1-2是包括character unigram和bigram。

Lindahe0707 · 2018-12-06T03:07:18Z

@shenshen-hungry 谢谢! 我们现在发现一个问题是，用上面说到的方法“context vector，然后把长度为一的拿出来就是字向量”，取出来之后，大约只有7000个字，但是很多常用字都不在里面，比如“乐”， “玻”， “璃”之类。这个你们有注意到吗？

shenshen-hungry · 2018-12-10T08:46:48Z

@Lindahe0707 刚看了一下，大多有一万多字，而且这几个字都是有的，要不然你再试试？

wzs951015 · 2019-04-05T13:18:34Z

你好，请问现在有字符向量吗？？还是只能从word->character中提取？

shenshen-hungry · 2019-04-08T05:28:47Z

@wzs951015 目前是这样操作的，这里的词向量在大部分任务中可以满足需求的。

BrikerMan · 2019-05-14T08:56:53Z

期待字向量

zengzenghe · 2019-05-20T02:45:08Z

对于词向量中没有的，怎么解决

shenshen-hungry · 2019-05-20T02:56:48Z

@zengzenghe 如果是单独使用词向量的话，可以用均值来表示未出现的。如果是用词向量初始化模型的词向量层，可以用均值或者随机初始化。

matthew77 · 2019-05-23T13:54:19Z

请问 Word → Character (1) 和Word → Character (1-2)以及Word → Character (1-4)里面的字向量有区别吗？如果有哪个效果好？

shenshen-hungry · 2019-05-23T15:26:25Z

@matthew77 区别在于训练中字符context不一样，比如：1-2指的是character unigram和character bigram。实际使用中，得根据下游任务实践才能知道哪个效果好。

matthew77 · 2019-05-24T07:41:26Z

我在做一个实体标注的项目。训练集是基于字的标注，如：

西 B-SHS
子 I-SHS
电 I-SHS
梯 I-SHS
于 O
2 B-CHD
0 I-CHD
1 I-CHD
5 I-CHD

由于训练集不是很大，所以想用Pre trained 字向量。请问这种情况下用，那个字向量的效果会比较好？

shenshen-hungry · 2019-05-25T01:42:46Z

@matthew77 建议都下载试试，对于不同的任务不同的字向量可能有差异。

matthew77 · 2019-05-26T09:42:54Z

请问一下，如果今后你们出专门的字向量，那同现在“Various Co-occurrence Information” 里面的字向量会有什么区别呢？

shenshen-hungry · 2019-05-27T02:44:28Z

@matthew77 本质上区别不大，因为利用的信息几乎一样。

matthew77 · 2019-06-18T08:30:33Z

谢谢！
还想请教一个问题，我是否可以用facebook 的fasttext，将里面的单个汉字拿出来做为我的字向量？这样做法是否合理？
同样，我是不是也可以从“Various Co-occurrence Information” 里面，用各种Context Word Vectors来测试，比如从Ngram里面讲字向量分离出来，做测试。是否是合理的？

shenshen-hungry · 2019-06-18T08:33:55Z

@matthew77 可以用的。不过Facebook的fasttext是在wiki上面训练的，对于下游任务来说百度百科语料是要好于wiki语料的，具体可以看这个项目首页的两篇论文。

matthew77 · 2019-06-18T08:57:34Z

谢谢！那我就每个都拿来做做测试。目前我用了fasttext和Word → Character (1)这两个。发现fastext的效果好不少。

shenshen-hungry · 2019-06-18T08:59:19Z

@matthew77 可以试试Word → Character (1-4) context vector。

guotong1988 · 2019-09-20T03:43:41Z

现在有了吗
《Is Word Segmentation Necessary for Deep Learning of Chinese Representations?》

shenshen-hungry · 2019-09-20T06:59:10Z

@guotong1988 在Word → Character (1-4) context vector里面有字向量，和词向量混合训练的，信息更充分。

deardeerluluu · 2019-09-30T08:52:11Z

作者大大，你说的那个字符向量文件解压完以后里面有一些不是单字的呀（比如“中国”，“一个”）

shenshen-hungry · 2019-09-30T08:54:18Z

@deardeerluluu 里面有字向量，不是只有字向量。

MrRace · 2020-11-13T07:41:41Z

会专门训练字向量吗？发现 Context Word Vectors中的字向量效果不如纯粹训练的字向量。

Embedding · 2020-11-14T12:52:41Z

会专门训练字向量吗？发现 Context Word Vectors中的字向量效果不如纯粹训练的字向量。

BERT模型大多数是基于字的
可以下载一个BERT模型，抽取里面的embedding，得到字向量
一个选择是从这里面下载BERT预训练模型 https://github.com/dbiir/UER-py/wiki/Modelzoo
然后用这个脚本抽取embedding https://github.com/dbiir/UER-py/blob/master/scripts/extract_embeddings.py

shenshen-hungry · 2020-11-14T13:19:57Z

@MrRace 请问“Context中的字向量效果不如单独训练的”这个有做过具体的比较吗？在什么情况下可以相差多少？

MrRace · 2020-11-16T01:41:44Z

@MrRace 请问“Context中的字向量效果不如单独训练的”这个有做过具体的比较吗？在什么情况下可以相差多少？

我在fasttext中分别测试的，分别使用预训练这两种的。也可能与数据集或者数据量有关。具体相差多少，有些忘记了。后续找找，兴许进一步对比。

shenshen-hungry · 2020-11-17T06:05:02Z

@MrRace 感觉可能跟数据集有关系，和下游任务越相似并且越大的数据集训练出来的向量效果越好。如果方便的话还请提供一下具体的实验配置，比如数据集、参数等相关信息，以便让看到的人持续地讨论。

zhangmianhongni · 2021-08-05T07:45:13Z

如何使用字向量对比2个不同长度的词语，比如“你好”和“我爱你”？

shenshen-hungry mentioned this issue Jul 16, 2019

有中文字向量的链接吗 #79

Closed

shenshen-hungry mentioned this issue Sep 20, 2019

字向量貌似挺重要，能开源单独的字向量就好了 #88

Closed

中文字符向量 #18

中文字符向量 #18

Comments

michael92ht commented May 31, 2018

shenshen-hungry commented May 31, 2018

rubby33 commented Jun 3, 2018

zhezhaoa commented Jun 6, 2018

aluminumbox commented Jun 7, 2018

shenshen-hungry commented Jun 7, 2018

zhezhaoa commented Jun 8, 2018

michael92ht commented Jun 12, 2018

WangHexie commented Jun 12, 2018

shenshen-hungry commented Jun 12, 2018

EricLingRui commented Oct 16, 2018

shenshen-hungry commented Oct 16, 2018

AliceHuJie commented Oct 23, 2018

shenshen-hungry commented Oct 23, 2018

sunnychou0330 commented Nov 9, 2018

shenshen-hungry commented Nov 9, 2018

sunnychou0330 commented Nov 9, 2018

Ramlinbird commented Nov 30, 2018

shenshen-hungry commented Nov 30, 2018

Lindahe0707 commented Dec 1, 2018

shenshen-hungry commented Dec 1, 2018

Lindahe0707 commented Dec 6, 2018 • edited Loading

shenshen-hungry commented Dec 10, 2018

wzs951015 commented Apr 5, 2019

shenshen-hungry commented Apr 8, 2019

BrikerMan commented May 14, 2019

zengzenghe commented May 20, 2019

shenshen-hungry commented May 20, 2019

matthew77 commented May 23, 2019

shenshen-hungry commented May 23, 2019

matthew77 commented May 24, 2019 • edited Loading

shenshen-hungry commented May 25, 2019

matthew77 commented May 26, 2019

shenshen-hungry commented May 27, 2019

matthew77 commented Jun 18, 2019

shenshen-hungry commented Jun 18, 2019

matthew77 commented Jun 18, 2019

shenshen-hungry commented Jun 18, 2019

guotong1988 commented Sep 20, 2019

shenshen-hungry commented Sep 20, 2019

deardeerluluu commented Sep 30, 2019

shenshen-hungry commented Sep 30, 2019

MrRace commented Nov 13, 2020 • edited Loading

Embedding commented Nov 14, 2020

shenshen-hungry commented Nov 14, 2020

MrRace commented Nov 16, 2020

shenshen-hungry commented Nov 17, 2020

zhangmianhongni commented Aug 5, 2021

Lindahe0707 commented Dec 6, 2018 •

edited

Loading

matthew77 commented May 24, 2019 •

edited

Loading

MrRace commented Nov 13, 2020 •

edited

Loading