Sunday, January 26, 2014

How to POS tag French texts?

In my last post, I talked about the general steps of pre-processing in NLP and some problems encountered when dealing with French texts. A question raised in the end was how to POS tag French texts? I found that an easier way is to use TreeTagger. As there is a lot of to read in the website, here is an simplified version of installation on Windows 7 (64-bit) system.

1) Go to http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
2) download windows version of TreeTagger,
unzip it,
and locate under C:\
add the path "C:\TreeTagger\bin" to the PATH environment variable
3) Download language parameter file,
unzip it, then you get a <language>.par file
move this <language>.par to the "lib" folder that is under  "C:\TreeTagger"
4) Install Perl interpreter (if you don't have), downloaded at http://www.activestate.com/activeperl/
Open "cmd" and type:
set PATH=C:\TreeTagger\bin;%PATH%
then, change directory to C:\TreeTagger
test the tagger, by typing:
tag-french INSTALL.txt

So far, it should be installed. After, I installed the TreeTagger interface:

5) download the interface from http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm
unzip it
put it into C:\TreeTagger\bin
double click, then it executes

In the interface, I can just choose the language I want to tag, indicate an input file and an output file, click "Run".

It shows that the input text can't be stemmed. Plus, removing punctuation and lowercasing texts would affect the results. Also as this application is based on Markov Model, so using this too to tag, the best choice is just taking the raw text as input, and then do the other steps if they are necessary.

A minor problem about output:
The output file can't show accent-contained French letters. So to see it properly, you can manually find and replace in Notepad++:
é = é
è = è
ê = ê
  = à
ç = ç

A small tip would help if you have trouble in replacing white space with carriage return in Notepad++.

1. replace all white space with 111111 (you can use anything, not only 111111) 
2. replace 111111 with \n (in Notepad++, make sure Extended is checked)




No comments:

Post a Comment