As most of us learned in grammar school, language follows rules - on the other hand, we also know that language is subject to infinite variation, invention and evolution that no textbook or set of rules could ever cover. High-quality Natural Language Processing is a balance act between the application of universal linguistic rules and the myriad of currently available probabilistic methods that can mine, learn and discover a large number of regularities from Big Data. It is a craft that requires technological know-how, linguistic experience and access to large quantities of relevant language data.
Based on our collection of data from the Chinese web, we use unsupervised machine learning methods, incl. deep learning, topic modelling and pattern mining, to extract new lexical, grammatical and semantic information. This process is run permanently so as to quickly integrate the newest data from the Chinese web.
All linguistic information is stored in a knowledge base, which comprises various lexica, ontologies and grammars. It is structured according to universal linguistic principles, such as lexical relations, phrase structure and semantic compositionality, which provide a sound foundation for new linguistic information.
We apply supervised and semi-supervised machine learning on annotated data sets and use the resulting models to improve the recall and the confidence of our output. Our existing model optimisation methods can be easily reused for training models for new domains and projects.