基于机器学习的建筑工程质量
验收规范信息抽取研究
周启瑞
摘 要
建筑业和房地产业是信息密集型、知识密集型产业,如何高效地应用这些信息和知识成为一项重要的研究课题。建筑工程质量验收规范是指导施工和审查的重要依据。目前,建筑规范的应用方式为人工阅读和查找,基于建筑规范的质量自动审查系统也大多依赖于人工建立规则,因此自动抽取建筑规范中的约束成为减少人工、提高自动化水平的关键技术。
本文以建筑工程质量验收规范作为研究对象,结合规范类文件的特点和信息抽取技术,提出了一个基于混合式机器学习的命名实体识别和信息抽取的方法,具体做出了如下工作:
1. 将建筑工程质量验收规范中的约束分为关系类约束和属性类约束,关系类约束是对两个工序之间的先后顺序关系和间隔时间做出规定,属性类约束是对材料或成品等对象的属性做出规定;
2. 运用基于Bi-LSTM-CRF的方法进行命名实体识别,取代了传统领域词典的方法,使得该模型在不同领域有更强的通用性;
3. 提出了一个基于LSTM-MLP的抽取模型。该模型弥补了基于规则的信息抽取对于专家制定规则的人工依赖问题,使得该模型有更高的自动化水平;
4. 对建筑工程质量验收规范进行信息抽取并结构化表示以后,提出了一种更加直观的展示形式方便学习和查询,为建筑工程中基于规范的验收工作提供便利,并可支持相关工程质量验收系统规则库的建立与更新。
通过模型测试结果可知,本文提出的基于混合式机器学习的信息抽取模型效果较好,命名实体识别的F1值达到了88.5%,信息抽取的F1值达到了83.8%。作为工程规范类文件信息抽取的首次尝试,本研究具有减少人工依赖、通用性强等优点,有较高的参考价值,并值得做进一步研究。
关键词:质量验收规范 信息抽取 混合式机器学习 Bi-LSTM-CRF 命名实体识别
Abstract
Construction industry and real estate industry are information-intensive and knowledge-intensive industries. How to use these information and knowledge efficiently has become an important research topic. The Codes for Quality Acceptance is an important basis for guiding construction and inspection. Nowadays, the application mode of Codes for Quality Acceptance is manual reading and searching, and the automatic quality inspection system based on Codes for Quality Acceptance mostly relies on manual establishment of rules, so automatic information extraction of Codes for Quality Acceptance becomes the key technology to reduce manual work and improve automation level.
In this paper, taking the Codes for constructional quality acceptance as the research object, combining the characteristics of the criterion file and information extraction technology, a method of named entity recognition and information extraction based on hybrid machine learning is proposed. The following work is done concretely:
1. Codes for constructional quality acceptance of building engineering are classified into relationship constraints and attribute constraints. Relationship constraints specify the sequence and interval between two working procedures, and attribute constraints specify the attributes of objects such as materials or finished products.
2. Named entity recognition based on Bi-LSTM-CRF is used instead of traditional domain dictionary, which makes the model more universal in different fields
3. An information extraction model based on LSTM-MLP is proposed. This model makes up for the artificial dependence of rule-based information extraction on expert rule-making, and makes the model more automated.
4. After extracting the information of Codes for constructional quality acceptance of building engineering, a more intuitive display form is proposed to facilitate learning and query, which facilitates the acceptance based on Codes in construction projects.
The results of model testing show that, the information extraction model based on hybrid machine learning proposed in this paper is effective. The F1 value of Named Entity Recognition reaches 88.5%, and the F1 value of Information Extraction reaches 83.8%. As the first attempt to extract information from engineering codes, this study has the advantages of reducing manual dependence and generality, having high reference value and being worth further study.
Key words: Codes for Quality Acceptance Information Extraction
Hybrid Machine Learning Bi-LSTM-CRF Named Entity Recognition