TY - JOUR
T1 - TechWord
T2 - Development of a technology lexical database for structuring textual technology information based on natural language processing
AU - Jang, Hyejin
AU - Jeong, Yujin
AU - Yoon, Byungun
N1 - Publisher Copyright:
© 2020 Elsevier Ltd
PY - 2021/2
Y1 - 2021/2
N2 - The role of text mining based on technological documents such as patents is important in the research field of technology intelligence for technology R&D planning. In addition, WordNet, an English-based lexical database, is widely used for pre-processing text data such as word lemmatization and synonym search. However, technological vocabulary information is complex and specific, and WordNet's ability to analyze technological information is limited in its reflecting technological features. Thus, to improve the text mining performance of technological information, this study proposes a methodology for designing a TechWord-based lexical database that is based on the lexical characteristics of technological words that are differentiated from general words. To do this, we define TechWord, a technology lexical information, and construct a TechSynset, a synonym set between TechWords. First, through dependency parsing between words, TechWord, a unit word that describes a technology, is structured and identifies nouns and verbs. The importance of connectivity is investigated by a network centrality index analysis based on the dependency relations of words. Subsequently, to search for synonyms suitable for the target technology domain, a TechSynset is constructed through synset information, with an additional analysis that calculates cosine similarity based on a word embedding vector. Applying the proposed methodology to the actual technology-related information analysis, we collect patent data on the technological fields of the automotive field, and present the results of the TechWord and TechSynset. This study improves technological information-based text mining by structuring the word-to-word link information in technological documents based on an automated process.
AB - The role of text mining based on technological documents such as patents is important in the research field of technology intelligence for technology R&D planning. In addition, WordNet, an English-based lexical database, is widely used for pre-processing text data such as word lemmatization and synonym search. However, technological vocabulary information is complex and specific, and WordNet's ability to analyze technological information is limited in its reflecting technological features. Thus, to improve the text mining performance of technological information, this study proposes a methodology for designing a TechWord-based lexical database that is based on the lexical characteristics of technological words that are differentiated from general words. To do this, we define TechWord, a technology lexical information, and construct a TechSynset, a synonym set between TechWords. First, through dependency parsing between words, TechWord, a unit word that describes a technology, is structured and identifies nouns and verbs. The importance of connectivity is investigated by a network centrality index analysis based on the dependency relations of words. Subsequently, to search for synonyms suitable for the target technology domain, a TechSynset is constructed through synset information, with an additional analysis that calculates cosine similarity based on a word embedding vector. Applying the proposed methodology to the actual technology-related information analysis, we collect patent data on the technological fields of the automotive field, and present the results of the TechWord and TechSynset. This study improves technological information-based text mining by structuring the word-to-word link information in technological documents based on an automated process.
KW - Lexical analysis
KW - Natural language processing
KW - Patent mining
KW - Text mining
KW - WordNet
UR - http://www.scopus.com/inward/record.url?scp=85091777411&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2020.114042
DO - 10.1016/j.eswa.2020.114042
M3 - Article
AN - SCOPUS:85091777411
SN - 0957-4174
VL - 164
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 114042
ER -