%s
. Next, I decided to reduce the number of words that don't carry any meaning. An obvious example from English language is the article "the". Looking up "the" in open-tran won't display any results, because the engine considers this phrase empty (no words). However, I speak only English and German. Together with my girlfriend and Wikipedia we prepared the following list of articles in several European languages:
Code | Language | List of ignored lexemes |
---|---|---|
de | German | das, dem, den, der, deren, des, dessen, die, ein, eine, einem, einen |
en | English | a, an, the |
es | Spanish | el, la, las, los, una, uno, unas, unos |
fr | French | la, le, les, un, une |
it | Italian | i, il, lo, gli, la, le, un, uno, una |
pl | Polish | by |
pt | Portuguese | o, os, a, as, um, uns, uma, umas |
I am aware that suffixes may be used as articles in some languages (e.g. Romanian) and if you have any idea on how to tackle this issue without integration of expensive, language-specific dictionaries, then let me know.
If you have other ideas on how to improve the accuracy and/or limit the number of records stored in the database - I will appreciate your feedback. Leave a comment here, or send an e-mail to open-tran@groups.google.com. Thanks!
6 comments:
Adpositions are good candidates for beeing ignored.
I'm not familiar with the algorithms used for matching, but was wondering if stemming might help. You can look at snowball for that. As for Afrikaans (af), you can use " 'n, die " as our two articles. Some of the other South African languages are harder, since there are no articles used in this way.
hi...
in pt and pt_BR you can add the following too:
"de, da, das, do, dos -> means 'of'; em, na, nas, no, nos -> means 'in'". Thinking better, the "nos" you can leave, because of the word "nós" which means 'us' in pt.
Regards,
Luiz
For french :
(un, une) plural = "des"
of = "de" or "d'" (before a vowel)
le = "l'" before a vowel
Do you need My, His, Her, Their etc. ?
You can add «l'» and «un'» to Italian articles; they are used before vowels without spaces before the following word. See http://it.wikipedia.org/wiki/Articolo_(grammatica)
Norwegian Bokmål:
en, ei, et, den, det
Norwegian Nynorsk:
ein, ei, eit, den, det
Georgian (Kartuli) has no articles (no definiteness marking).
BTW, have you tried looking at frequency lists for the various languages? If you find words of <= 3 characters in the top 10, you could likely cull them out. Actually, for a large enough corpus, the top 10 will probably be quite useless for your purposes; or?
Post a Comment