Open-Tran.eu: Articles in various languages

Saturday, August 9, 2008

Articles in various languages

I need your advice again. In an effort to further reduce the database and provide more accurate results I am trying to exclude various types of "words" (lexemes or tokens to be precise). First, I removed all format specifiers, like %s. Next, I decided to reduce the number of words that don't carry any meaning. An obvious example from English language is the article "the". Looking up "the" in open-tran won't display any results, because the engine considers this phrase empty (no words). However, I speak only English and German. Together with my girlfriend and Wikipedia we prepared the following list of articles in several European languages:

Code	Language	List of ignored lexemes
de	German	das, dem, den, der, deren, des, dessen, die, ein, eine, einem, einen
en	English	a, an, the
es	Spanish	el, la, las, los, una, uno, unas, unos
fr	French	la, le, les, un, une
it	Italian	i, il, lo, gli, la, le, un, uno, una
pl	Polish	by
pt	Portuguese	o, os, a, as, um, uns, uma, umas

I am planning to add Dutch articles (de, een, het) in the future, too. As you can see, my table covers only 7 languages and open-tran supports more than 90. So I hope that maybe you could help me assembling similar lists for the remaining languages.

I am aware that suffixes may be used as articles in some languages (e.g. Romanian) and if you have any idea on how to tackle this issue without integration of expensive, language-specific dictionaries, then let me know.

If you have other ideas on how to improve the accuracy and/or limit the number of records stored in the database - I will appreciate your feedback. Leave a comment here, or send an e-mail to open-tran@groups.google.com. Thanks!

6 comments:

Anonymous said...: Adpositions are good candidates for beeing ignored.; August 13, 2008 at 5:10 AM
Anonymous said...: I'm not familiar with the algorithms used for matching, but was wondering if stemming might help. You can look at snowball for that. As for Afrikaans (af), you can use " 'n, die " as our two articles. Some of the other South African languages are harder, since there are no articles used in this way.; August 15, 2008 at 12:25 PM
elchevive said...: hi...

in pt and pt_BR you can add the following too:
"de, da, das, do, dos -> means 'of'; em, na, nas, no, nos -> means 'in'". Thinking better, the "nos" you can leave, because of the word "nós" which means 'us' in pt.

Regards,

Luiz; August 18, 2008 at 7:41 PM
Anonymous said...: For french :
(un, une) plural = "des"
of = "de" or "d'" (before a vowel)
le = "l'" before a vowel

Do you need My, His, Her, Their etc. ?; August 19, 2008 at 3:33 PM
Anonymous said...: You can add «l'» and «un'» to Italian articles; they are used before vowels without spaces before the following word. See http://it.wikipedia.org/wiki/Articolo_(grammatica); September 22, 2008 at 9:57 PM
Kevin Brubeck Unhammer said...: Norwegian Bokmål:
en, ei, et, den, det

Norwegian Nynorsk:
ein, ei, eit, den, det

Georgian (Kartuli) has no articles (no definiteness marking).

BTW, have you tried looking at frequency lists for the various languages? If you find words of <= 3 characters in the top 10, you could likely cull them out. Actually, for a large enough corpus, the top 10 will probably be quite useless for your purposes; or?; September 25, 2009 at 1:44 PM

Open-Tran.eu

Saturday, August 9, 2008

Articles in various languages

6 comments:

Project Links

Similar Projects

Related Projects

Blog Archive

About Me