Bengali Harmful Language Dataset for AI Moderation & Safety Research
An open-source Bengali (Bangla) slang and harmful-language dataset built by Amirul Sizan to support AI moderation, NLP safety research, and Bangla content filtering systems.

Artificial Intelligence is evolving fast.
But for Bangla language moderation and safety research, there’s still a massive gap.
Most existing moderation datasets are heavily focused on English. Meanwhile, Bengali internet culture continues to grow across Facebook, YouTube, TikTok, gaming communities, and public forums — where slang, toxic language, harassment, coded insults, and culturally contextual harmful phrases are commonly used but rarely documented in structured AI-ready datasets.
That gap inspired me to build Deshi Slang.
What is Deshi Slang?
Deshi Slang is an open Bengali (Bangla) slang and harmful-language dataset designed for:
- AI moderation systems
- NLP research
- Toxicity detection
- Bangla language safety tools
- Content filtering research
- LLM alignment and moderation experiments
The goal of the project is simple:
make Bangla internet language more understandable for machines.
Because moderation in Bangla is not just about detecting direct abusive words. It also involves:
- regional slang
- phonetic typing
- meme culture
- context-driven insults
- coded harassment
- transliterated Bangla-English mixed language
- evolving internet expressions
Most AI systems fail to understand these nuances properly.
Why This Project Matters
Bangla is one of the most spoken languages in the world, yet open moderation-focused resources for Bengali remain extremely limited.
Many AI systems can detect harmful English content fairly well.
But when it comes to Bangla, moderation accuracy drops significantly because of:
- lack of datasets
- limited labeled data
- informal internet language variations
- spelling inconsistencies
- romanized Bangla usage
As someone deeply involved in digital media, online communities, and AI-driven content ecosystems, I noticed this problem repeatedly.
Platforms can moderate English content at scale.
But Bangla harmful language often bypasses filters completely.
Deshi Slang was created to contribute toward solving that problem.
Project Goals
The project focuses on building an openly accessible dataset that researchers and developers can use to:
Build Safer AI Systems
Train moderation pipelines capable of understanding Bengali harmful expressions and slang patterns.
Improve Bangla NLP Research
Support toxicity classification, hate speech detection, abusive-language analysis, and sentiment-related research.
Encourage Open Research
Provide a public resource for students, developers, researchers, and startups working on Bangla AI tools.
Preserve Contextual Internet Language
Document evolving Bengali internet slang and culturally contextual expressions that are usually ignored in formal datasets.
Key Challenges
Building a harmful-language dataset in Bangla is not straightforward.
Some of the major challenges include:
Language Variability
The same slang can appear in multiple spellings across Bengali script and Romanized Bangla.
Context Sensitivity
Certain words can be humorous in one context and abusive in another.
Regional Differences
Bangladesh and West Bengal internet cultures often use different slang variations.
Rapid Evolution
Online slang evolves incredibly fast, especially through meme pages and short-form content platforms.
These factors make Bangla moderation research significantly more difficult compared to standardized English datasets.
Open Source Contribution
I believe language technology should not remain limited to large corporations only.
That’s why Deshi Slang is fully open-source and publicly accessible for the developer and research community.
The repository is intended to evolve continuously through:
- community contributions
- dataset expansion
- better categorization
- multilingual mapping
- moderation experimentation
Tech & Research Relevance
This project can be useful for:
- AI startups
- moderation platforms
- LLM fine-tuning
- academic NLP research
- Bangla chatbot development
- social media monitoring systems
- safety-focused AI products
It also highlights an important reality:
AI safety cannot become globally effective if low-resource languages are ignored.
Repository
Explore the project here:
Final Thoughts
The future of AI moderation should not be English-only.
As Bangla digital communities continue to grow, we need better datasets, better research, and better language-aware safety systems.
Deshi Slang is a small contribution toward that future.
