Thursday, January 23, 2014

General steps of pre-processing in NLP and common problems met in dealing with French texts

We know that in NLP (Natural Language Processing), if we use the assumption of "bag-of-words" (the order of words doesn't matter), before we start any analysis (i.e. clustering, classification, topic extraction, etc.) on a collection of documents, let's name this collection as "corpus", a few things have to be done. General steps contains:

  1. Lowercase all the text in each document.
  2. Remove punctuation. 
  3. Remove stop words, such as "a", "is", "that"... in English, or "un", "que", "suis"... in French. These words don't carry any real meanings but appear quite often.
  4. Tokenization. It simply cut "cats are walking" into "cats" "are" "walking".
  5. Stemming/Lemmatisation. In most cases you can do one of them, but not both. Stemming and lemmatisation are different. Stemming can convert "walking/walked" into "walk", but it can't do something lemmatisation can do, like converting "women" to "woman". 
  6. So far, each document would be a set of words, if we define the pre-processed words are terms, we can actually assign an ID to each term.
  7. Represent texts or say documents into Vector Space Model , so each text/document would be written into a form of vector. Inside the vector, there are term ID and a weight. This weight can be the number of occurrences of a term in a document, or it can be a if-idf weight. So "cats are walking" will be represented as [(0,1),(1,1)]. 0 is the ID of term "cat" (after stemming, "cats" becomes "cat"), the 1 after 0 says that term "cat" shows up once; "are" has been removed as a stop word. "walking" turns to "walk" after being stemmed, and "walk" has been assigned ID 1, as "walk" also appear only once, so its occurrence is also 1.
Very well. After the above 7 steps, you can basically turn a corpus into a big matrix, and with this matrix you can do whatever you want.

If you program with Python, you can have a few very good libraries, for example nltk, scikit-learn, gensim and so on. nltk is specialized in NLP with Python, it contains a lot of modules, which you can use to achieve any of the above steps with only a few lines of code; scikit-learn is a more general tool on Machine Learning (ML) with Python, it contains many ML algorithms, you may need a bit more to apply them onto texts; gensim is even more specialized, it is used to extract topic from a corpus. 

Dealing with English texts is well better than working on other languages. I have been working on French texts, French words that contains these letters é, ç, è, à, indeed gave me some hard time. Sometimes they just can't be shown properly in my SQL database, or they just can't be correctly encoded or decoded when I parsed an XML file. I found it might be tricky to use Notepad++, too. If you open your text with Notepad++, it is well easier to treat them if they are "Encode with UTF-8 without BOM". But most time they are originally in "Encode in ANSI" if your files are processed from Microsoft Word. 

Bigger problems occurs when it comes to stem French texts. If you are using Python nltk, it has Potter Stemming module, for English texts, you can just use this module and "Bingo", everything can be done. But it just can't be that easy with French texts, as French words have more different suffixes, and Potter Stemming doesn't support for French texts. Also it is more difficult to lemmatize French texts. 

Also in Part-of-Speech (POS), tagging French sentences becomes tricky too. And unfortunately, so far nltk doesn't have a well-implemented POS tag tool for French. Stanford NLP group has updated its POS tagger with an extension to French with Python, but I didn't find it very easy to use. After some exploration on Google, I luckily found TreeTagger, a language independent POS tagger. It is independent because it is using Markov Models

In my next blogger, I will explain how to use TreeTagger and POS tag French texts. 

1 comment: