Amirul Sizan
Amirul Sizan
Open Source

Bengali Harmful Language Dataset for AI Moderation & Safety Research

An open-source Bengali (Bangla) slang and harmful-language dataset built by Amirul Sizan to support AI moderation, NLP safety research, and Bangla content filtering systems.

Client
N/A
Type
Other
Timeline
3 Days
Bengali Harmful Language Dataset for AI Moderation & Safety Research

Artificial Intelligence is evolving fast.

But for Bangla language moderation and safety research, there’s still a massive gap.

Most existing moderation datasets are heavily focused on English. Meanwhile, Bengali internet culture continues to grow across Facebook, YouTube, TikTok, gaming communities, and public forums — where slang, toxic language, harassment, coded insults, and culturally contextual harmful phrases are commonly used but rarely documented in structured AI-ready datasets.

That gap inspired me to build Deshi Slang.

What is Deshi Slang?

Deshi Slang GitHub Repository

Deshi Slang is an open Bengali (Bangla) slang and harmful-language dataset designed for:

  • AI moderation systems
  • NLP research
  • Toxicity detection
  • Bangla language safety tools
  • Content filtering research
  • LLM alignment and moderation experiments

The goal of the project is simple:

make Bangla internet language more understandable for machines.

Because moderation in Bangla is not just about detecting direct abusive words. It also involves:

  • regional slang
  • phonetic typing
  • meme culture
  • context-driven insults
  • coded harassment
  • transliterated Bangla-English mixed language
  • evolving internet expressions

Most AI systems fail to understand these nuances properly.

Why This Project Matters

Bangla is one of the most spoken languages in the world, yet open moderation-focused resources for Bengali remain extremely limited.

Many AI systems can detect harmful English content fairly well.

But when it comes to Bangla, moderation accuracy drops significantly because of:

  • lack of datasets
  • limited labeled data
  • informal internet language variations
  • spelling inconsistencies
  • romanized Bangla usage

As someone deeply involved in digital media, online communities, and AI-driven content ecosystems, I noticed this problem repeatedly.

Platforms can moderate English content at scale.

But Bangla harmful language often bypasses filters completely.

Deshi Slang was created to contribute toward solving that problem.

Project Goals

The project focuses on building an openly accessible dataset that researchers and developers can use to:

Build Safer AI Systems

Train moderation pipelines capable of understanding Bengali harmful expressions and slang patterns.

Improve Bangla NLP Research

Support toxicity classification, hate speech detection, abusive-language analysis, and sentiment-related research.

Encourage Open Research

Provide a public resource for students, developers, researchers, and startups working on Bangla AI tools.

Preserve Contextual Internet Language

Document evolving Bengali internet slang and culturally contextual expressions that are usually ignored in formal datasets.

Key Challenges

Building a harmful-language dataset in Bangla is not straightforward.

Some of the major challenges include:

Language Variability

The same slang can appear in multiple spellings across Bengali script and Romanized Bangla.

Context Sensitivity

Certain words can be humorous in one context and abusive in another.

Regional Differences

Bangladesh and West Bengal internet cultures often use different slang variations.

Rapid Evolution

Online slang evolves incredibly fast, especially through meme pages and short-form content platforms.

These factors make Bangla moderation research significantly more difficult compared to standardized English datasets.

Open Source Contribution

I believe language technology should not remain limited to large corporations only.

That’s why Deshi Slang is fully open-source and publicly accessible for the developer and research community.

The repository is intended to evolve continuously through:

  • community contributions
  • dataset expansion
  • better categorization
  • multilingual mapping
  • moderation experimentation

Tech & Research Relevance

This project can be useful for:

  • AI startups
  • moderation platforms
  • LLM fine-tuning
  • academic NLP research
  • Bangla chatbot development
  • social media monitoring systems
  • safety-focused AI products

It also highlights an important reality:

AI safety cannot become globally effective if low-resource languages are ignored.

Repository

Explore the project here:

GitHub — Deshi Slang Dataset

Final Thoughts

The future of AI moderation should not be English-only.

As Bangla digital communities continue to grow, we need better datasets, better research, and better language-aware safety systems.

Deshi Slang is a small contribution toward that future.

Amirul Sizan

Amirul Sizan

Digital Creator & Designer

Play with Words, Play With Design, Play With Alogorithm, Haha!

Client Feedback

What they say about this project

"Team player, very creative in design and visual storytelling."

Founder, CEO

C
Nazmul Razu
Nazmul Razu
Founder, CEO - CHS Education
"He is a problem solver, through his design, idea, leadership."

Founder

P
Mahiuddin Sohel
Mahiuddin Sohel
Founder - Priyo Shikkhaloy
"Working with Sizan bhai is fun!"

Co-founder

M
Shah Jalal Alif
Shah Jalal Alif
Co-founder - Maroon Inc
"Team player, very creative in design and visual storytelling."

Founder, CEO

C
Nazmul Razu
Nazmul Razu
Founder, CEO - CHS Education
"He is a problem solver, through his design, idea, leadership."

Founder

P
Mahiuddin Sohel
Mahiuddin Sohel
Founder - Priyo Shikkhaloy
"Working with Sizan bhai is fun!"

Co-founder

M
Shah Jalal Alif
Shah Jalal Alif
Co-founder - Maroon Inc
"Team player, very creative in design and visual storytelling."

Founder, CEO

C
Nazmul Razu
Nazmul Razu
Founder, CEO - CHS Education
"He is a problem solver, through his design, idea, leadership."

Founder

P
Mahiuddin Sohel
Mahiuddin Sohel
Founder - Priyo Shikkhaloy
"Working with Sizan bhai is fun!"

Co-founder

M
Shah Jalal Alif
Shah Jalal Alif
Co-founder - Maroon Inc