TechWord: Development of a technology lexical database for structuring textual technology information based on natural language processing

Hyejin Jang, Yujin Jeong, Byungun Yoon

Research output: Contribution to journalArticlepeer-review

33 Scopus citations

Abstract

The role of text mining based on technological documents such as patents is important in the research field of technology intelligence for technology R&D planning. In addition, WordNet, an English-based lexical database, is widely used for pre-processing text data such as word lemmatization and synonym search. However, technological vocabulary information is complex and specific, and WordNet's ability to analyze technological information is limited in its reflecting technological features. Thus, to improve the text mining performance of technological information, this study proposes a methodology for designing a TechWord-based lexical database that is based on the lexical characteristics of technological words that are differentiated from general words. To do this, we define TechWord, a technology lexical information, and construct a TechSynset, a synonym set between TechWords. First, through dependency parsing between words, TechWord, a unit word that describes a technology, is structured and identifies nouns and verbs. The importance of connectivity is investigated by a network centrality index analysis based on the dependency relations of words. Subsequently, to search for synonyms suitable for the target technology domain, a TechSynset is constructed through synset information, with an additional analysis that calculates cosine similarity based on a word embedding vector. Applying the proposed methodology to the actual technology-related information analysis, we collect patent data on the technological fields of the automotive field, and present the results of the TechWord and TechSynset. This study improves technological information-based text mining by structuring the word-to-word link information in technological documents based on an automated process.

Original languageEnglish
Article number114042
JournalExpert Systems with Applications
Volume164
DOIs
StatePublished - Feb 2021

Keywords

  • Lexical analysis
  • Natural language processing
  • Patent mining
  • Text mining
  • WordNet

Fingerprint

Dive into the research topics of 'TechWord: Development of a technology lexical database for structuring textual technology information based on natural language processing'. Together they form a unique fingerprint.

Cite this