CCF BDCI2021 面向黑灰产治理的恶意短信变体字还原

以下是CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原解决方案的简要介绍。通过Seq2seq模型对变异文本进行复原，生成正常文本。可以在预训练模型仓库章节中找到下面使用的预训练模型。

BART-base模型

使用中文预训练模型BART-base在面向黑灰产治理的恶意短信变体字还原数据集上做微调和预测示例：

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_bart_base_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/bart/base_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/bart/base_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

BART-large模型

使用中文预训练模型BART-large在面向黑灰产治理的恶意短信变体字还原数据集上做微调和预测示例：

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_bart_large_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/bart/large_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/bart/large_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

PEGASUS-base模型

使用中文预训练模型PEGASUS-base在面向黑灰产治理的恶意短信变体字还原数据集上做微调和预测示例：

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_pegasus_base_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/pegasus/base_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/pegasus/base_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

PEGASUS-large模型

利用中文预训练模型PEGASUS-large在面向黑灰产治理的恶意短信变体字还原数据集上做微调和预测示例：

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_pegasus_large_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/pegasus/large_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/pegasus/large_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCF BDCI2021 面向黑灰产治理的恶意短信变体字还原

BART-base模型

BART-large模型

PEGASUS-base模型

PEGASUS-large模型

Clone this wiki locally