文章检索
首页» 本刊导读 750-763     DOI : 10.3969/j.issn.2096-1693.2024.05.057
最新目录| | 过刊浏览| 高级检索
融合旋转位置编码与掩码条件随机场的钻井工程命名实体智能识别方法
曹倩雯,李维,林伯韬*,金衍,韩雪银,张家豪
1 中国石油大学( 北京) 人工智能学院,北京 102249 2 中国石油大学( 北京) 石油工程学院,北京 102249 3 中海油能源发展有限公司工程技术分公司,天津 300452
Intelligent named entities recognition for drilling engineering by integrating rotational position embedding and masked conditional random fields
CAO Qianwen, LI Wei, LIN Botao, JIN Yan, HAN Xueyin, ZHANG Jiahao
1 College of Artificial Intelligence, China University of Petroleum- Beijing, Beijing 102249, China 2 College of Petroleum Engineering, China University of Petroleum- Beijing, Beijing 102249, China 3 China National Offshore Oil Corporation Energy Development Co., Ltd. Engineering Technology Branch, Tianjin 300452, China

全文:   HTML (1 KB) 
文章导读  
摘要  钻井工程报告记录了油气藏的地质信息以及钻井工程的参数,自动提取报告中的非结构化信息能够显著 提高数据入湖的效率,从而实现高效数据管理。然而,这类报告通常具有特定领域的特征,且结构和语言的多样性给命名实体的准确识别带来了诸多挑战。目前,命名实体识别常用的深度神经网络模型通常基于小规模标注数据集进行训练或微调,导致两方面问题。首先,缺乏大规模的标注语料库,限制了训练样本的多样性,进而导致模型在面对新数据或未见过的数据时表现不佳,降低了模型在不同类型数据上的泛化能力。其次,现有模型缺乏针对长距离上下文的文本建模能力,由于相关实体可能分散在钻井工程报告内较长的文本段落中,这类方法难以有效捕获和识别复杂文档中命名实体的关系。为了解决上述问题,本文提出了一种融合旋转位置编码和掩码条件随机场的钻井工程命名实体智能识别方法。该方法基Transformer编码器、双向长短期记忆网络(BiLSTM)和条件随机场(CRF)架构。Transformer编码器利用预训练语言模型提供丰富的上下文语义表示,BiLSTM捕捉序列依赖性,而CRF则用于序列标注。此外,通过设计掩码建模机制改进了传统的CRF,限制了倒置序列的生成,提高了序列标注次序的一致性。旋转位置编码的集成进一步增强了模型对文本中相对位置信息的感知,促进模型捕捉远距离单词之间的依赖关系,从而提高识别跨越较大上下文范围的命名实体的能力。除了模型改进之外,本文还通过构建领域特定的命名实体语料库来解决训练数据不足的问题。该语料库包括12类实体的标注,覆盖了共20 727 个实体标签,分布于4 000 个文本段落中,为模型提供了更多样化的训练样本,帮助提高模型的泛化能力。实验结果表明,本文提出的模型在测试集上的F1 值为86.49,相较于之前的最优模型提高了2.65,在长尾分布的实体识别上的性能也显著提高。该方法不仅扩展了命名实体识别在钻井工程中的应用,还能够为工程师提供高效的信息提取工具,加速钻井数据的分析,提高钻井操作管理的效率,并增强数据入湖的效率,从而对钻井项目的决策过程带来积极影响。
服务
把本文推荐给朋友
加入我的书架
加入引用管理器
关键词 : 命名实体识别,钻井工程,Transformer编码器,自然语言处理,深度学习
Abstract

Drilling engineering reports record geological information about oil and gas reservoirs as well as various drilling engineering parameters. The automatic extraction of unstructured information from these reports can significantly improve the efficiency of data integration into data lakes, thereby enabling more efficient data management. However, these reports typically have domain-specific characteristics, and the diversity of their structure and language presents considerable challenges for accurate named entity recognition (NER). Currently, deep neural network models commonly used for NER are typically trained or fine-tuned on small-scale annotated datasets, leading to two main issues. First, the lack of large-scale annotated corpora limits the diversity of training samples, which in turn causes poor performance when the model encounters new or unseen data, decreasing the model’s generalization ability across different types of data. Second, existing models lack the ability to effectively model long-distance contextual information in texts. Since relevant entities may be scattered across long text segments in drilling engineering reports, these methods often struggle to capture and recognize relationships between named entities in complex documents. To address the aforementioned issues, this paper proposes an intelligent method for named entity recognition in drilling engineering that integrates rotational position embedding and masked conditional random fields. The proposed method is based on a Transformer encoder, a bidirectional long short-term memory network (BiLSTM), and a conditional random field (CRF) architecture. The Transformer encoder leverages pre-trained language models to provide rich contextual semantic representations, BiLSTM captures sequential dependencies, and CRF is used for sequence labeling. Moreover, the traditional CRF is improved by designing a masked modeling mechanism, which restricts the generation of inverted sequences, thereby enhancing the consistency of sequence labeling order. The integration of rotational position embedding further enhances the model's awareness of relative positional information in the text, allowing the model to better capture dependencies between distant words. This improves the model's ability to recognize named entities spread across larger contextual ranges. In addition to model improvements, this paper also addresses the issue of insufficient training data by constructing a domain-specific named entity corpus. This corpus includes annotations for 12 categories of entities, covering a total of 20,727 entity labels across 4,000 text segments. This enriched dataset provides more diverse training samples, which helps improve the model's generalization ability. Experimental results show that the proposed model achieves an F1 score of 86.49 on the test set, representing an improvement of 2.65 percentage points over the previous best-performing model. Furthermore, the model demonstrates significant improvements in recognizing entities with long-tail distributions, which are often underrepresented in typical training datasets. This method not only expands the application of named entity recognition in the field of drilling engineering but also provides engineers with an efficient tool for extracting critical information. By accelerating the analysis of drilling data, it improves the efficiency of drilling operations management and enhances data lake integration, ultimately bringing positive impacts to the decision-making process in drilling projects.


Key words: named entity recognition; drilling engineering; transformer encoder; natural language processing; deep learning
收稿日期: 2024-10-31     
PACS:    
基金资助:国家自然基金项目(No. 62402526) 和中国石油大学( 北京) 科研启动基金项目(2462024BJRC013) 联合资助
通讯作者: linbotao@cup.edu.cn
引用本文:   
曹倩雯, 李维, 林伯韬, 金衍, 韩雪银, 张家豪. 融合旋转位置编码与掩码条件随机场的钻井工程命名实体智能识别方法. 石油科 学通报, 2024, 09(05): 750-763 CAO Qianwen, LI Wei, LIN Botao, JIN Yan, HAN Xueyin, ZHANG Jiahao1. Intelligent named entities recognition for drilling engineering by integrating rotational position embedding and masked conditional random fields. Petroleum Science Bulletin, 2024, 09(05): 750-763.
链接本文:  
版权所有 2016 《石油科学通报》杂志社