Projects

Abebawu Yigezu

Author: Abebawu Yigezu

Experience: Extensive expertise in Machine Learning, Natural Language Processing (NLP), and Data Science, with a strong focus on AdTech, data analytics, and data engineering. I have led and contributed to numerous projects involving real-time data processing, campaign optimization, and advanced AI-driven solutions in the advertising technology space, delivering impactful results and insights through cutting-edge techniques.

Transformer (BERT, ROBERTa, Transformer-Xl, DistilBERT, XLNet, XLM) for Text Classification

The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. The motivation behind transformer is to deal with practical problem of the popular sequence-to-sequence(Seq2Seq) models RNN and its flavors such as GRU and LSTM. Despite boosted contribution of seq2seq models, there are certain limitations:

In addition to the original paper, I recommend to read https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/

Objective of this notebook

I will use the dataset used by: Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. “Automated Hate Speech Detection and the Problem of Offensive Language.”,Proceedings of the 11th International AAAI Conference on Web and Social Media (https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15665)

STEPS

import pandas as pd
import numpy as np
import pickle
tweet = pd.read_csv("../data/labeled_data.csv")
tweet.head()
Unnamed: 0 count hate_speech offensive_language neither class tweet
0 0 3 0 0 3 2 !!! RT @mayasolovely: As a woman you shouldn't...
1 1 3 0 3 0 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2 2 3 0 3 0 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3 3 3 0 2 1 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4 4 6 0 6 0 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...

The data are stored as a CSV and as a pickled pandas dataframe. Each data file contains 6 columns:

Since, the aim of the project is classifying hate speech I will use only the last two columns (class and tweet)

text=tweet['tweet']
label=tweet['class']
print(text.shape,label.shape)
(24783,) (24783,)

Defining transformer basically requires defining four basic steps mentioned above: embedding, positional encoding, masking, prediction layer (usually feed forward NN). However, instead of creating the architecture from the scratch we can use it from:

I will use create a class that use simplerepresentations wrapper library to use transformers.

import logging
logging.basicConfig(level=logging.INFO)
from tensorflow.python.keras.utils.data_utils import Sequence
from simplerepresentations import RepresentationModel

class TransformerModel(Sequence):       
    def __init__(self, representation_model, tweet, labels, batch_size, token_level=True):
        self.representation_model = representation_model
        self.tweet = tweet
        self.labels = labels
        self.batch_size = batch_size
        self.token_level = token_level

    def __len__(self):
        return int(np.ceil(len(self.tweet) / float(self.batch_size)))
    
    def tweet_generator(self,idx, tweet):
        tweet_batch = np.array(answer[idx * self.batch_size:(idx + 1) * self.batch_size])
        
        tweet_sen_batch, tweet_tok_batch = self.representation_model(tweet_batch)

        if self.token_level:
            tweet_batch = tweet_tok_batch
        else:
            tweet_batch = tweet_sen_batch
            
        return tweet_batch
    
    def __getitem__(self, idx):
        tweet_generator=self.tweet_generator(idx,self.tweet[0])
      
        labels_batch = np.array(self.labels[idx * self.batch_size:(idx + 1) * self.batch_size])
        
        return tweet_generator, np.array(labels_batch)

representation_model = RepresentationModel(
    model_type=['bert', 'xlnet', 'xlm', 'roberta', 'distilbert'],
    model_name=model_name,
    batch_size=64,
    max_seq_length=60, # truncate sentences to be less than or equal to 128 tokens
    combination_method='cat', # sum the last `last_hidden_to_use` hidden states
    last_hidden_to_use=1, # use the last 1 hidden states to build tokens representations
    verbose=0
)
train_generator= DataGenerator(representation_model, text, label, 64)

INFO:transformers.file_utils:PyTorch version 1.1.0 available.
INFO:transformers.file_utils:TensorFlow version 2.0.0-alpha0 available.