Qusaii
1 min readApr 29, 2021

Kaggle Toxic Comment Classification.

In a recent Kaggle competition, participants were invited to explore statistical techniques that would improve the detection abilities of the Perspective tool. The given dataset consisted of numerous comments taken from Wikipedia talk page edits, that were labelled by human raters as toxic behaviour based on the following types of toxicity: ‘toxic’, ‘severe toxic’, ‘obscene’, ‘threat’, ‘insult’ and ‘identity hate’. In a supervised learning framework, participants were asked to build a model that estimated the probabilities of comments being categorized as toxic respective to each level of toxicity.

Our project aims to help improve online conversation by exploring various Machine Learning, Deep Learning and NLP methods to build an accurate model that’s capable of detecting diverse types of negative online comments perceived as toxic.

Some of the following steps will be followed during the course of the project.

  1. Data Preprocessing
  • Noise Removal
  • Lexicon Normalization
  • Lemmatization
  • Stemming
  1. Feature Extraction:
  • Statistical features
  • TF — IDF
  1. Methods of Model Building:
  • Machine Learning — Logistic Regression, Naive Bayes, Random Forest, XGBoost
  • Deep Learning — LSTM
  1. Model Evaluation:
  • ROC curves
  • Hamming score for Multilabel Evaluation
Qusaii

📈 Data Science | 🧠 AI | 💼 Entrepreneurship | 🐱‍💻 CyberSecurity | | 🖋️ Writer.