Saturday, August 9, 2008

Articles in various languages

I need your advice again. In an effort to further reduce the database and provide more accurate results I am trying to exclude various types of "words" (lexemes or tokens to be precise). First, I removed all format specifiers, like %s. Next, I decided to reduce the number of words that don't carry any meaning. An obvious example from English language is the article "the". Looking up "the" in open-tran won't display any results, because the engine considers this phrase empty (no words). However, I speak only English and German. Together with my girlfriend and Wikipedia we prepared the following list of articles in several European languages:
CodeLanguageList of ignored lexemes
deGermandas, dem, den, der, deren, des, dessen, die, ein, eine, einem, einen
enEnglisha, an, the
esSpanishel, la, las, los, una, uno, unas, unos
frFrenchla, le, les, un, une
itItaliani, il, lo, gli, la, le, un, uno, una
plPolishby
ptPortugueseo, os, a, as, um, uns, uma, umas
I am planning to add Dutch articles (de, een, het) in the future, too. As you can see, my table covers only 7 languages and open-tran supports more than 90. So I hope that maybe you could help me assembling similar lists for the remaining languages.

I am aware that suffixes may be used as articles in some languages (e.g. Romanian) and if you have any idea on how to tackle this issue without integration of expensive, language-specific dictionaries, then let me know.

If you have other ideas on how to improve the accuracy and/or limit the number of records stored in the database - I will appreciate your feedback. Leave a comment here, or send an e-mail to open-tran@groups.google.com. Thanks!

6 comments:

Anonymous said...

Adpositions are good candidates for beeing ignored.

Anonymous said...

I'm not familiar with the algorithms used for matching, but was wondering if stemming might help. You can look at snowball for that. As for Afrikaans (af), you can use " 'n, die " as our two articles. Some of the other South African languages are harder, since there are no articles used in this way.

elchevive said...

hi...

in pt and pt_BR you can add the following too:
"de, da, das, do, dos -> means 'of'; em, na, nas, no, nos -> means 'in'". Thinking better, the "nos" you can leave, because of the word "nós" which means 'us' in pt.

Regards,

Luiz

Anonymous said...

For french :
(un, une) plural = "des"
of = "de" or "d'" (before a vowel)
le = "l'" before a vowel

Do you need My, His, Her, Their etc. ?

Anonymous said...

You can add «l'» and «un'» to Italian articles; they are used before vowels without spaces before the following word. See http://it.wikipedia.org/wiki/Articolo_(grammatica)

Kevin Brubeck Unhammer said...

Norwegian Bokmål:
en, ei, et, den, det

Norwegian Nynorsk:
ein, ei, eit, den, det

Georgian (Kartuli) has no articles (no definiteness marking).

BTW, have you tried looking at frequency lists for the various languages? If you find words of <= 3 characters in the top 10, you could likely cull them out. Actually, for a large enough corpus, the top 10 will probably be quite useless for your purposes; or?