Chemically Augmented String Kernel for Extraction and Classification of Chemical Compounds from Text

Abstract

Chemical compounds provide key information for text documents in the material science domain. For extraction and classification of chemical compounds in text, we present a novel kernel - Chemically Augmented String Kernel (CASK) - which incorporates periodic table properties of chemical elements. To evaluate the performance of CASK, we performed experiments on a corpus of 14,656 journal paper abstracts in the field of Thin Film Solar Cells (TFSCs) published during 1990-2014. TFSCs are devices which typically contain four layers, each composed of different materials. Each abstract was treated as a bag of words and its words were classified into one of the five classes: four classes corresponding to each of the four layers of TFSCs, and one class categorizing unrelated words that do not belong to any of the layers. CASK was found to outperform string subsequence kernel and other kernels such as radial basis function, polynomial and linear.

Topics

6 Figures and Tables

Download Full PDF Version (Non-Commercial Use)