Langboat Technology provides learning guidance for developers who want to get started with NLP technology
The purpose of automatic summarization is to provide users with concise and concise text descriptions by compressing and refining the original text. Automatic summarization is an information compression process that summarizes the content of one or more input documents into a brief description. Information loss is inevitable in this process, but it is required to retain as much important information as possible. Automatic summarization is also an important task in the field of natural language generation. In this article, we take the text summarization task as an example to show the process of fine-tuning the Mengzi pre-trained model on downstream tasks. The overall process can be divided into 4 parts:
The following is a demonstration of the Chinese scientific literature data (CSL) text summary data.
Data download address: https://github.com/CLUEbenchmark/CLGE
The complete sample code address of this article is: https://github.com/Langboat/Mengzi/blob/main/examples/Mengzi_summary.ipynb
CSL data is stored in the form of json. We can read the data file by defining the read_json function, and load the data into memory.
def read_json(input_file: str) -> list:with open(input_file, 'r') as f:lines = f.readlines()return list(map(json.loads, tqdm(lines, desc='Reading...')))train = read_json("csl/v1/train.json")dev = read_json("csl/v1/dev.json")
The specific information of the dataset is as follows:
数据划分 | 数据量 |
---|---|
训练集 | 2500 |
测试集 | 500 |
The original format of each training sample is as follows:
{'id': 2364,'title': '基于语义规则的Web服务发现方法','abst': '语义Web服务发现问题研究的核心内容是服务描述与对应的服务发现方法。服务描述分为服务请求描述与服务发布描述,但目前的服务发现方法,并未将请求描述与发布描述分开,以比对服务请求描述与服务发布描述中对应部分作为匹配依据,导致服务请求描述构建困难以及发现结果不够理想。提出以语义规则刻画服务请求描述,以本体构建服务发布描述,进行有效的以语义规则驱动的Web服务发现。对语义规则添加影响因子使得服务匹配精度可以通过匹配度来度量,并按照给定的调节系数来决定最终匹配是否成功。最后以OWL-STCV2测试服务集合进行了对比实验,证实该方法有效地提高了查全率与查准率高,特别是Top-k查准率。'}
The purpose of data preprocessing is to process raw data into an input form acceptable to the model, which is equivalent to establishing a pipeline between raw data and model input. For model input, the acceptable fields are input_ids and labels, where input_ids is the tokenized representation of the input text, which can be converted directly through the Tokenizer provided by transformers; labels is the tokenized representation of the expected output text of the model. Complete the above process by defining the DataCollatorForSeq2Seq data preprocessing class and passing it to the model. See code for details.
Before training the model, you need to specify the hyperparameters for model training, including the number of training rounds, learning rate, learning rate management strategy, and so on. This can be achieved by instantiating the TrainingArguments class, and then passing it to the Trainer to pass in these hyperparameters; then train through the trainer.train() method defined by huggingface, and pass the trainer.save_model() method after training Save the best model.
Mengzi_tokenizer = T5Tokenizer.from_pretrained(model_path)Mengzi_model = T5ForConditionalGeneration.from_pretrained(model_path)trainer = Trainer(tokenizer=Mengzi_tokenizer,model=Mengzi_model,args=training_args,data_collator=collator,train_dataset=trainset,eval_dataset=devset)trainer.train()trainer.save_model("test/best")
The best model is saved in the test/best
location, we can load the best model and use it for summary generation. The following is an implementation method for us to use the model for reasoning. The tokenized text that we want to simplify is passed into the model, and the digested text can be obtained after tokenizer decoding. Of course, readers can also use the methods they are familiar with to generate.
def predict(sources, batch_size=8):_model = model.eval() #将模型转换为预测模式,使模型内部的dropout失效。kwargs = {"num_beams": 4}outputs = []for start in tqdm(range(0, len(sources), batch_size)):batch = sources[start:start+batch_size]input_tensor = tokenizer(batch, return_tensors="pt", truncation=True, padding=True, max_length=512).input_ids.cuda()outputs.extend(model.generate(input_ids=input_tensor, **kwargs))return tokenizer.batch_decode(outputs, skip_special_tokens=True)
The automatic evaluation indicators Rouge-1, Rouge-2, and Rouge-L, which are commonly used in automatic summarization tasks, are used to evaluate the quality of the generated text.
def rouge_score(candidate, reference):text1 = " ".join(list(candidate))text2 = " ".join(list(reference))score = rouge.get_scores(text1, text2)print(score)return scoredef compute_rouge(preds, refs):r1=[]r2=[]R_L=[]for pred, ref in zip(preds, refs):score = rouge_score(pred, ref)r1.append(scores[0]["rouge-1"]["f"])r2.append(scores[0]["rouge-2"]["f"])R_L.append(scores[0]["rouge-l"]["f"])return sum(r1)/len(r1), sum(r2)/len(r2), sum(R_L)/len(R_L)R_1, R_2, R_L = compute_rouge(generations, titles)
Business Cooperation Email
Address
Floor 16, Fangzheng International Building, No. 52 Beisihuan West Road, Haidian District, Beijing, China.
© 2023, Langboat Co., Limited. All rights reserved.
Large Model Registration Code:Beijing-MengZiGPT-20231205
Business Cooperation:
bd@langboat.com
Address:
Floor 16, Fangzheng International Building, No. 52 Beisihuan West Road, Haidian District, Beijing, China.
Official Accounts:
© 2023, Langboat Co., Limited. All rights reserved.
Large Model Registration Code:Beijing-MengZiGPT-20231205