1) Go to http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
2) download windows version of TreeTagger,
unzip it,
and locate under C:\
add the path "C:\TreeTagger\bin" to the PATH environment variable
3) Download language parameter file,
unzip it, then you get a <language>.par file
move this <language>.par to the "lib" folder that is under "C:\TreeTagger"
4) Install Perl interpreter (if you don't have), downloaded at http://www.activestate.com/activeperl/
Open "cmd" and type:
set PATH=C:\TreeTagger\bin;%PATH%
then, change directory to C:\TreeTagger
test the tagger, by typing:
tag-french INSTALL.txt
So far, it should be installed. After, I installed the TreeTagger interface:
5) download the interface from http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm
unzip it
put it into C:\TreeTagger\bin
double click, then it executes
In the interface, I can just choose the language I want to tag, indicate an input file and an output file, click "Run".
It shows that the input text can't be stemmed. Plus, removing punctuation and lowercasing texts would affect the results. Also as this application is based on Markov Model, so using this too to tag, the best choice is just taking the raw text as input, and then do the other steps if they are necessary.
A minor problem about output:
The output file can't show accent-contained French letters. So to see it properly, you can manually find and replace in Notepad++:
é = é
è = è
ê = ê
 = à
ç = ç
A small tip would help if you have trouble in replacing white space with carriage return in Notepad++.
1. replace all white space with 111111 (you can use anything, not only 111111)
2. replace 111111 with \n (in Notepad++, make sure Extended is checked)
No comments:
Post a Comment