Language detection for Android: Given a string of text, identify what language the text is written in.
This project is a fork of an excellent Java language detection library (language-detection) written by Nakatani Shuyo. The original git version control history and commit messages are retained in this project.
I've made two significant changes to the original code:
-
Speed enhancements. As an alternative to using JSON-based text files for storing language profiles, a Python script is used to convert language profiles into Java code that can be bundled with an app. With the resulting performance improvement, language detection is fast enough to run acceptably on Android devices.
-
Additional language profiles:
- Aragonese
- Asturian
- Basque
- Belarusian
- Breton
- Catalan
- Galician
- Haitian
- Icelandic
- Irish
- Malay
- Maltese
- Occitan
- Serbian (Cyrillic alphabet)
- Welsh
- Yiddish
git clone git@github.com:rmtheis/language-detection.git
See the original project on Google Code.
Set up the language profile list in DetectorFactory.java.
To generate a language profile, download a Wikipedia abstract file to use as a training data set.
For example, click anwiki and download anwiki-20121227-abstract.xml to
language-detection/abstracts/ and do:
cd language-detection
mkdir abstracts/profiles
java -jar lib/langdetect.jar --genprofile -d language-detection/abstracts an
python scripts/genprofile.py -i abstracts/profiles/an > AN.java