Data Science Projects

Abebawu Eshetu

Author: Abebawu Eshetu

Research Interest: Natural Language Processing, Machine Learning, and Computer Vision for Social Goods.

Airline Sentiment Analysis Project

Project Objective

Data Analysis

import pandas as pd ## for reading and undestanding data
import matplotlib.pyplot as plt ## for plotting data
import seaborn as sns ## another library to visualize data features
import numpy as np ## for numerical array processing
##reading data
data=pd.read_csv('twitter-airline/Tweets.csv')
data[['airline_sentiment','negativereason','airline','retweet_count','tweet_created']].head()
airline_sentiment negativereason airline retweet_count tweet_created
0 neutral NaN Virgin America 0 2015-02-24 11:35:52 -0800
1 positive NaN Virgin America 0 2015-02-24 11:15:59 -0800
2 neutral NaN Virgin America 0 2015-02-24 11:15:48 -0800
3 negative Bad Flight Virgin America 0 2015-02-24 11:15:36 -0800
4 negative Can't Tell Virgin America 0 2015-02-24 11:14:45 -0800
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 8 columns):
tweet_id                        14640 non-null int64
text                            14640 non-null object
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
airline                         14640 non-null object
retweet_count                   14640 non-null int64
tweet_created                   14640 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 915.1+ KB
semtiments=pd.crosstab(data.airline, data.airline_sentiment)
semtiments
airline_sentiment negative neutral positive
airline
American 1960 463 336
Delta 955 723 544
Southwest 1186 664 570
US Airways 2263 381 269
United 2633 697 492
Virgin America 181 171 152
negative_tweet=data[(data['airline_sentiment']=='negative')]
negative_tweet[['airline','negativereason','text']].head()
airline negativereason text
3 Virgin America Bad Flight @VirginAmerica it's really aggressive to blast...
4 Virgin America Can't Tell @VirginAmerica and it's a really big bad thing...
5 Virgin America Can't Tell @VirginAmerica seriously would pay $30 a fligh...
15 Virgin America Late Flight @VirginAmerica SFO-PDX schedule is still MIA.
17 Virgin America Bad Flight @VirginAmerica I flew from NYC to SFO last we...
negative_tweet.airline.value_counts() #counts number of negative rate for each airline to identify worest airway of 2015
United            2633
US Airways        2263
American          1960
Southwest         1186
Delta              955
Virgin America     181
Name: airline, dtype: int64

Most common words in negative tweets

from wordcloud import WordCloud
def plotWords(words):
    wordcloud=WordCloud(width=1200, height=600, random_state=21,max_font_size=110).generate(words)
    plt.figure(figsize=(10,7))
    plt.imshow(wordcloud,interpolation="bilinear")
    plt.axis('off')
    plt.show()
neg_tweet_words=negative_tweet.text.values.tolist()
neg_words=' '.join([text for text in neg_tweet_words])
plotWords(neg_words)

png

The plot is showing wich airline service is more tweeted for negative sentiment and reason for negativity.

Lets look at posetive comments to understand services on which customers are more satisfied.

posetive_tweet=data[(data['airline_sentiment']=='positive')]
pos_tweet_words=posetive_tweet.text.values.tolist()
pos_words=' '.join([text for text in pos_tweet_words])
plotWords(pos_words)

png

appreciate, good, thanks, really, great, amazing, best, nice, happy, … shows services on which customers are ok with airlines.

def plot_bar(title,x_label,y_label,data):
    fig, ax = plt.subplots(figsize=(10, 3))
    ax.tick_params(axis='x', labelsize=12)
    ax.tick_params(axis='y', labelsize=12)
    ax.set_ylabel(y_label , fontsize=12)
    ax.set_title(title, fontsize=15, fontweight='bold')
    _=data.plot(kind='bar')
reason_count=negative_tweet['negativereason'].value_counts()
_=reason_count.plot(kind='bar')

png

airline_neg_reason=negative_tweet.groupby('airline')['negativereason'].value_counts()
def plot_sns(x,y,data):
    sns.set(rc={'figure.figsize':(10,10)})
    ax=sns.countplot(y=y,hue=x,data=data)
    for p in ax.patches:
        patch_height = p.get_height()
        if np.isnan(patch_height):
            patch_height = 0
        ax.annotate('{}'.format(int(patch_height)), (p.get_x()+0.01, patch_height+0.5),ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
    plt.title("Distribution of negative reason for each airline")
    plt.show()
plot_sns('negativereason','airline',negative_tweet)


![png](/neg_service.png)


The plot and table above interstingly depicts the United, US, and American airlines has worest service than Delta, Virgin America, and Southwest airlines. Except, Delta and Virgin America airways, the rest four has no good customer handling and United and US airways also mostly late on flight time. Comaratively, Virgin America is good than other and then Delta is next choise.

# Does flight time has relation to negative reason?

We will focus on top three airlines with negative sentiment


```python
#time based analysis
data['tweet_created']=data['tweet_created'].astype('datetime64[ns]') ## conversion of data type to datetime
data[['airline','airline_sentiment','tweet_created','negativereason']].tail()
airline airline_sentiment tweet_created negativereason
14635 American positive 2015-02-22 20:01:01 NaN
14636 American negative 2015-02-22 19:59:46 Customer Service Issue
14637 American neutral 2015-02-22 19:59:15 NaN
14638 American negative 2015-02-22 19:59:02 Customer Service Issue
14639 American neutral 2015-02-22 19:58:51 NaN
data['tweet_created_date']=data.tweet_created.dt.date
data['tweet_created_weekday_name']=data.tweet_created.dt.weekday_name
data['tweet_created_hour']=data.tweet_created.dt.hour
data[['airline','airline_sentiment','tweet_created_weekday_name','tweet_created_hour']].tail()
airline airline_sentiment tweet_created_weekday_name tweet_created_hour
14635 American positive Sunday 20
14636 American negative Sunday 19
14637 American neutral Sunday 19
14638 American negative Sunday 19
14639 American neutral Sunday 19

Negative reason of tweet vs day of the week. Which day flight has most negative indicator?

negative_tweet=data[(data['airline_sentiment']=='negative')]
neg_by_wkday = negative_tweet.groupby(['tweet_created_weekday_name']).negativereason.value_counts()
neg_by_wkday = neg_by_wkday.unstack().plot(kind='line',figsize=(10,5),rot=0,title="Negetive Reasons by Day of Week")
neg_by_wkday.set_xlabel("Day of Week")
neg_by_wkday.set_ylabel("Negative Reason")
Text(0, 0.5, 'Negative Reason')

png

The plot clearly depicts expect Friday, Saturday, Thursady and Wednesday flights are comaratively good. Monday, Sunday and Tuesday flights has customer service problem and are mostly late (the green lines also shows that probability of cancelation of flights by Monday, Sunday and Tuesday is high).

neg_by_time = negative_tweet.groupby(['tweet_created_hour']).negativereason.value_counts()

neg_by_time = neg_by_time.unstack().plot(kind='line',figsize=(10, 5),title="Negetive Reasons by Hour")
neg_by_time.set_xlabel("Time")
neg_by_time.set_ylabel("Negative Reason")
Text(0, 0.5, 'Negative Reason')

png

Time based analysis is showing something good look to optimize airline service.

Flights at time range 0:00 A.M -03:00 A.M and 04:00 PM - 06:00 PM are with high customer dististfaction.