TY - JOUR
T1 - A data type inference method based on long short-term memory by improved feature for weakness analysis in binary code
AU - Jeong, Junho
AU - Lim, Joong Yeon
AU - Son, Yunsik
N1 - Publisher Copyright:
© 2019
PY - 2019/11
Y1 - 2019/11
N2 - As software is used in various areas today, software security has become a crucial issue. Third-party libraries, which play a major role in software development, pose difficulties in analyzing and testing software security. It is essential to know the variables used in software and the data type information of each variable in order to identify the major weaknesses in the software. However, because the third-party library is generally of the binary code form, the variables, variable data type, program syntax, and semantic information in the source code are removed. Therefore, reconstructing the variables used and the data type information of the variables from binary code is the most important step in weak point analysis. Traditionally, this step of reconstructing information is based on pattern matching; however, the inference of data types is limited. We herein proposed a method of inferring data types using deep learning for variables determined based on pattern matching in binary code, and analyzed its performance. The proposed study has improved the feature generation method to solve the inconsistent problems of the features generated in the previous studies. As a result, the accuracy of prediction of float and double is improved by average 7.2% compared to the previous study, and the result is that the accuracy of 5.1% is increased overall.
AB - As software is used in various areas today, software security has become a crucial issue. Third-party libraries, which play a major role in software development, pose difficulties in analyzing and testing software security. It is essential to know the variables used in software and the data type information of each variable in order to identify the major weaknesses in the software. However, because the third-party library is generally of the binary code form, the variables, variable data type, program syntax, and semantic information in the source code are removed. Therefore, reconstructing the variables used and the data type information of the variables from binary code is the most important step in weak point analysis. Traditionally, this step of reconstructing information is based on pattern matching; however, the inference of data types is limited. We herein proposed a method of inferring data types using deep learning for variables determined based on pattern matching in binary code, and analyzed its performance. The proposed study has improved the feature generation method to solve the inconsistent problems of the features generated in the previous studies. As a result, the accuracy of prediction of float and double is improved by average 7.2% compared to the previous study, and the result is that the accuracy of 5.1% is increased overall.
KW - Binary code
KW - Data type inference
KW - Long short-term memory
KW - Reconstruction data information
KW - Software weakness
UR - http://www.scopus.com/inward/record.url?scp=85067021627&partnerID=8YFLogxK
U2 - 10.1016/j.future.2019.05.013
DO - 10.1016/j.future.2019.05.013
M3 - Article
AN - SCOPUS:85067021627
SN - 0167-739X
VL - 100
SP - 1044
EP - 1052
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -