Mengzi Community | Langboat Technology - Industry-Leading Cognitive Intelligence Company

Console User Center

Welcome to Langboat Mengzi Community!

Langboat Technology provides learning guidance for developers who want to get started with NLP technology

Scan code to join Mengzi open source community

Introductory NLP Fundamentals Course

Preface

Python Programming

Programming Related Tools

Mathematical Basis

Data Processing and Analysis Fundamentals

Machine Learning Fundamentals

Deep Learning Fundamentals

Basics of Natural Language Processing NLP Advanced Course

NLP Advanced Course

Automatic Summary Generation Guide

NLP Course

Technology Sharing

Automatic Summary Generation Guide#

The purpose of automatic summarization is to provide users with concise and concise text descriptions by compressing and refining the original text. Automatic summarization is an information compression process that summarizes the content of one or more input documents into a brief description. Information loss is inevitable in this process, but it is required to retain as much important information as possible. Automatic summarization is also an important task in the field of natural language generation. In this article, we take the text summarization task as an example to show the process of fine-tuning the Mengzi pre-trained model on downstream tasks. The overall process can be divided into 4 parts:

data loading
Data preprocessing
Model training
Model reasoning

The following is a demonstration of the Chinese scientific literature data (CSL) text summary data.

Data download address: https://github.com/CLUEbenchmark/CLGE

The complete sample code address of this article is: https://github.com/Langboat/Mengzi/blob/main/examples/Mengzi_summary.ipynb

Data Loading#

CSL data is stored in the form of json. We can read the data file by defining the read_json function, and load the data into memory.

def read_json(input_file: str) -> list:
    with open(input_file, 'r') as f:
        lines = f.readlines()
    return list(map(json.loads, tqdm(lines, desc='Reading...')))

train = read_json("csl/v1/train.json")
dev = read_json("csl/v1/dev.json")

The specific information of the dataset is as follows:

数据划分	数据量
训练集	2500
测试集	500

The original format of each training sample is as follows:

{
    'id': 2364, 
    'title': '基于语义规则的Web服务发现方法', 
    'abst': '语义Web服务发现问题研究的核心内容是服务描述与对应的服务发现方法。服务描述分为服务请求描述与服务发布描述,但目前的服务发现方法,并未将请求描述与发布描述分开,以比对服务请求描述与服务发布描述中对应部分作为匹配依据,导致服务请求描述构建困难以及发现结果不够理想。提出以语义规则刻画服务请求描述,以本体构建服务发布描述,进行有效的以语义规则驱动的Web服务发现。对语义规则添加影响因子使得服务匹配精度可以通过匹配度来度量,并按照给定的调节系数来决定最终匹配是否成功。最后以OWL-STCV2测试服务集合进行了对比实验,证实该方法有效地提高了查全率与查准率高,特别是Top-k查准率。'
}

Data Preprocessing#

The purpose of data preprocessing is to process raw data into an input form acceptable to the model, which is equivalent to establishing a pipeline between raw data and model input. For model input, the acceptable fields are input_ids and labels, where input_ids is the tokenized representation of the input text, which can be converted directly through the Tokenizer provided by transformers; labels is the tokenized representation of the expected output text of the model. Complete the above process by defining the DataCollatorForSeq2Seq data preprocessing class and passing it to the model. See code for details.

Model Training#

Before training the model, you need to specify the hyperparameters for model training, including the number of training rounds, learning rate, learning rate management strategy, and so on. This can be achieved by instantiating the TrainingArguments class, and then passing it to the Trainer to pass in these hyperparameters; then train through the trainer.train() method defined by huggingface, and pass the trainer.save_model() method after training Save the best model.

Mengzi_tokenizer = T5Tokenizer.from_pretrained(model_path)
Mengzi_model = T5ForConditionalGeneration.from_pretrained(model_path)
trainer = Trainer(
    tokenizer=Mengzi_tokenizer,
    model=Mengzi_model,
    args=training_args,
    data_collator=collator,
    train_dataset=trainset,
    eval_dataset=devset
)
trainer.train()
trainer.save_model("test/best")

Model Inference#

The best model is saved in the test/best location, we can load the best model and use it for summary generation. The following is an implementation method for us to use the model for reasoning. The tokenized text that we want to simplify is passed into the model, and the digested text can be obtained after tokenizer decoding. Of course, readers can also use the methods they are familiar with to generate.

def predict(sources, batch_size=8):
    _model = model.eval() #将模型转换为预测模式，使模型内部的dropout失效。

    kwargs = {"num_beams": 4}

    outputs = []
    for start in tqdm(range(0, len(sources), batch_size)):
        batch = sources[start:start+batch_size]

        input_tensor = tokenizer(batch, return_tensors="pt", truncation=True, padding=True, max_length=512).input_ids.cuda()

        outputs.extend(model.generate(input_ids=input_tensor, **kwargs))
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

Generating Results Evaluation#

The automatic evaluation indicators Rouge-1, Rouge-2, and Rouge-L, which are commonly used in automatic summarization tasks, are used to evaluate the quality of the generated text.

def rouge_score(candidate, reference):
    text1 = " ".join(list(candidate))
    text2 = " ".join(list(reference))
    score = rouge.get_scores(text1, text2)
    print(score)
    return score

def compute_rouge(preds, refs):
    r1=[]
    r2=[]
    R_L=[]
    for pred, ref in zip(preds, refs):
        score = rouge_score(pred, ref)
        r1.append(scores[0]["rouge-1"]["f"])
        r2.append(scores[0]["rouge-2"]["f"])
        R_L.append(scores[0]["rouge-l"]["f"])
    return sum(r1)/len(r1), sum(r2)/len(r2), sum(R_L)/len(R_L)

R_1, R_2, R_L = compute_rouge(generations, titles)

Already Completed All Courses?

Let's use the new generation of NLP technology to promote the next wave of productivity evolution!

Products

Langboat Meeting Assistant

Enterprise Applications

Langboat Smart Agent Builder

Langboat Intelligent Translation

Mengzi Models

Solutions

Finance

Marketing

Culture

General

About Us

Mengzi Community

Great News

Join Us

Business Cooperation Email

bd@langboat.com

Address

Floor 16, Fangzheng International Building, No. 52 Beisihuan West Road, Haidian District, Beijing, China.

Large Model Registration Code：Beijing-MengZiGPT-20231205