Sunday, January 26, 2014

How to POS tag French texts?

In my last post, I talked about the general steps of pre-processing in NLP and some problems encountered when dealing with French texts. A question raised in the end was how to POS tag French texts? I found that an easier way is to use TreeTagger. As there is a lot of to read in the website, here is an simplified version of installation on Windows 7 (64-bit) system.

1) Go to http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
2) download windows version of TreeTagger,
unzip it,
and locate under C:\
add the path "C:\TreeTagger\bin" to the PATH environment variable
3) Download language parameter file,
unzip it, then you get a <language>.par file
move this <language>.par to the "lib" folder that is under  "C:\TreeTagger"
4) Install Perl interpreter (if you don't have), downloaded at http://www.activestate.com/activeperl/
Open "cmd" and type:
set PATH=C:\TreeTagger\bin;%PATH%
then, change directory to C:\TreeTagger
test the tagger, by typing:
tag-french INSTALL.txt

So far, it should be installed. After, I installed the TreeTagger interface:

5) download the interface from http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm
unzip it
put it into C:\TreeTagger\bin
double click, then it executes

In the interface, I can just choose the language I want to tag, indicate an input file and an output file, click "Run".

It shows that the input text can't be stemmed. Plus, removing punctuation and lowercasing texts would affect the results. Also as this application is based on Markov Model, so using this too to tag, the best choice is just taking the raw text as input, and then do the other steps if they are necessary.

A minor problem about output:
The output file can't show accent-contained French letters. So to see it properly, you can manually find and replace in Notepad++:
é = é
è = è
ê = ê
  = à
ç = ç

A small tip would help if you have trouble in replacing white space with carriage return in Notepad++.

1. replace all white space with 111111 (you can use anything, not only 111111) 
2. replace 111111 with \n (in Notepad++, make sure Extended is checked)




Thursday, January 23, 2014

General steps of pre-processing in NLP and common problems met in dealing with French texts

We know that in NLP (Natural Language Processing), if we use the assumption of "bag-of-words" (the order of words doesn't matter), before we start any analysis (i.e. clustering, classification, topic extraction, etc.) on a collection of documents, let's name this collection as "corpus", a few things have to be done. General steps contains:

  1. Lowercase all the text in each document.
  2. Remove punctuation. 
  3. Remove stop words, such as "a", "is", "that"... in English, or "un", "que", "suis"... in French. These words don't carry any real meanings but appear quite often.
  4. Tokenization. It simply cut "cats are walking" into "cats" "are" "walking".
  5. Stemming/Lemmatisation. In most cases you can do one of them, but not both. Stemming and lemmatisation are different. Stemming can convert "walking/walked" into "walk", but it can't do something lemmatisation can do, like converting "women" to "woman". 
  6. So far, each document would be a set of words, if we define the pre-processed words are terms, we can actually assign an ID to each term.
  7. Represent texts or say documents into Vector Space Model , so each text/document would be written into a form of vector. Inside the vector, there are term ID and a weight. This weight can be the number of occurrences of a term in a document, or it can be a if-idf weight. So "cats are walking" will be represented as [(0,1),(1,1)]. 0 is the ID of term "cat" (after stemming, "cats" becomes "cat"), the 1 after 0 says that term "cat" shows up once; "are" has been removed as a stop word. "walking" turns to "walk" after being stemmed, and "walk" has been assigned ID 1, as "walk" also appear only once, so its occurrence is also 1.
Very well. After the above 7 steps, you can basically turn a corpus into a big matrix, and with this matrix you can do whatever you want.

If you program with Python, you can have a few very good libraries, for example nltk, scikit-learn, gensim and so on. nltk is specialized in NLP with Python, it contains a lot of modules, which you can use to achieve any of the above steps with only a few lines of code; scikit-learn is a more general tool on Machine Learning (ML) with Python, it contains many ML algorithms, you may need a bit more to apply them onto texts; gensim is even more specialized, it is used to extract topic from a corpus. 

Dealing with English texts is well better than working on other languages. I have been working on French texts, French words that contains these letters é, ç, è, à, indeed gave me some hard time. Sometimes they just can't be shown properly in my SQL database, or they just can't be correctly encoded or decoded when I parsed an XML file. I found it might be tricky to use Notepad++, too. If you open your text with Notepad++, it is well easier to treat them if they are "Encode with UTF-8 without BOM". But most time they are originally in "Encode in ANSI" if your files are processed from Microsoft Word. 

Bigger problems occurs when it comes to stem French texts. If you are using Python nltk, it has Potter Stemming module, for English texts, you can just use this module and "Bingo", everything can be done. But it just can't be that easy with French texts, as French words have more different suffixes, and Potter Stemming doesn't support for French texts. Also it is more difficult to lemmatize French texts. 

Also in Part-of-Speech (POS), tagging French sentences becomes tricky too. And unfortunately, so far nltk doesn't have a well-implemented POS tag tool for French. Stanford NLP group has updated its POS tagger with an extension to French with Python, but I didn't find it very easy to use. After some exploration on Google, I luckily found TreeTagger, a language independent POS tagger. It is independent because it is using Markov Models

In my next blogger, I will explain how to use TreeTagger and POS tag French texts.