一种基于领域知识的检索增强生成方法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

河北省自然科学基金(F2022208006 ,F2023207003); 河北省高等学校科学技术研究项目(QN2024196)


A retrieval-augmented generation method based on domain knowledge
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为了提高当前大语言模型(large language model,LLM)在利用检索文档生成答案时的准确性,提出一种基于领域知识的检索增强生成(retrieval-augmented generation,RAG)方法。首先,在检索过程中通过问题和领域知识进行第1层的稀疏检索,为后续的稠密检索提供领域数据集;其次,在生成过程中采用零样本学习的方法,将领域知识拼接在问题之前或之后,并与检索文档结合,输入到大语言模型中;最后,在医疗领域和法律领域数据集上使用大语言模型ChatGLM2-6B、Baichuan2-7B-chat进行多次实验,并进行性能评估。结果表明:基于领域知识的检索增强生成方法能够有效提高大语言模型生成答案的领域相关度,并且零样本学习方法相较于微调方法表现出更好的效果;采用零样本学习方法时,融入领域知识的稀疏检索和领域知识前置方法在ChatGLM2-6B上取得了最佳提升效果,与基线方法相比,ROUGE-1、ROUGE-2和ROUGE-L评分分别提高了3.82、1.68、4.32个百分点。所提方法能够提升大语言模型生成答案的准确性,为开放域问答的研究和应用提供重要参考。

    Abstract:

    In order to enhance the accuracy of current large language model (LLM) in generating answers using retrieval documents, a retrieval-augmented generation method based on domain knowledge was proposed. Firstly, during the retrieval process, the first layer of sparse retrieval was conducted using both the question and domain knowledge, providing a domain-specific dataset for subsequent dense retrieval. Secondly, in the generation process, a zero-shot learning method was employed to concatenate domain knowledge before or after the question, and combined it with the retrieved documents to input into the large language model. Finally, extensive experiments were conducted on datasets in the medical and legal domains using ChatGLM2-6B and Baichuan2-7B-chat, and performance evaluations were conducted. The results indicate that the retrieval-augmented generation method based on domain knowledge can effectively improve the domain relevance of the answers generated by large language models, and the zero-shot learning method outperforms the fine-tuning method. When the zero-shot learning method is used, the sparse retrieval incorporating domain knowledge and the method of placing domain knowledge before the question achieve the best improvement on ChatGLM2-6B, with ROUGE-1, ROUGE-2 and ROUGE-L scores increasing by 3.82, 1.68 and 4.32 percentage points respectively compared to the baseline method. The proposed method can enhance the accuracy of the answers generated by large language models and provide an important reference for the research and application of open-domain question answering.

    参考文献
    相似文献
    引证文献
引用本文

张高飞,李 欢,池云仙,赵巧红,勾智楠,高 凯.一种基于领域知识的检索增强生成方法[J].河北工业科技,2025,42(2):103-110

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-11-08
  • 最后修改日期:2025-01-19
  • 录用日期:
  • 在线发布日期: 2025-04-03
  • 出版日期:
文章二维码