MTEB: Massive Text Embedding Benchmark
-
Updated
Apr 23, 2026 - Python
MTEB: Massive Text Embedding Benchmark
[EMNLP 2023] 💬 Language Identification with Support for More Than 2000 Labels
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
NLP pipelines for Tagalog using spaCy
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
A Scandinavian Benchmark for sentence embeddings
Code and datasets for the ACL 2021 paper "OntoED: Low-resource Event Detection with Ontology Embedding"
[SIGIR 2023] Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph Construction
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
Materials for AACL-IJCNLP-2022 tutorial: Efficient and Robust Knowledge Graph Construction
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
Official codebase for the ACL 2025 Findings paper: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval.
A curated list of awesome sentiment analysis studies, in which attitude corresponds to the text position conveyed by Subject towards other Object mentioned in text such as: entities, events, etc.
Awesome Lao Natural Language Processing
Code for paper "ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models"
This repository contains the code, data, and associated models of the paper titled "BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset", accepted in Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: AACL 2022.
Pashto Natural Language Processing Toolkit
Find the best datasets for intermediate fine-tuning
Add a description, image, and links to the low-resource-nlp topic page so that developers can more easily learn about it.
To associate your repository with the low-resource-nlp topic, visit your repo's landing page and select "manage topics."