Statistics and Its Interface

Volume 10 (2017)

Number 2

Word segmentation in Chinese language processing

Pages: 165 – 173

DOI: https://dx.doi.org/10.4310/SII.2017.v10.n2.a1

Authors

Xinxin Shu (Department of Biostatistics and Research Decision Sciences, Merck Research Laboratories, U.S.A.)

Junhui Wang (Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong)

Xiaotong Shen (School of Statistics, University of Minnesota, Minneapolis, Mn., U.S.A.)

Annie Qu (Department of Statistics, University of Illinois at Urbana-Champaign, Illinois, U.S.A.)

Abstract

This paper proposes a new statistical learning method for word segmentation in Chinese language processing. Word segmentation is the crucial first step towards natural language processing. Segmentation, despite progress, remains under-studied; particularly for the Chinese language, the second most popular language among all internet users. One major difficulty is that the Chinese language is highly context-dependent and ambiguous in terms of word representations. To overcome this difficulty, we cast the problem of segmentation into a framework of sequence classification, where an instance (observation) is a sequence of characters, and a class label is a sequence determining how each character is segmented. Given the class label, each character sequence can be segmented into linguistically meaningful words. The proposed method is investigated through the Peking university corpus of Chinese documents. Our numerical study shows that the proposed method compares favorably with the state-of-the-art segmentation methods in the literature.

Keywords

cutting-plane algorithm, language processing, support vector machines, word segmentation

2010 Mathematics Subject Classification

Primary 62H30. Secondary 68T50.

Published 31 October 2016