Airline Sentiment Analysis Project

Project Objective

Analysing data to visualize airline trends
- What most posetive or negative tweets for each airline service talks about?
- Does time of flight affect quality service?
- Which reason commonly tweeted by customers for bad service?
- Counting for retweeted negative tweets to shows which service is highly affecting.
Classifying whether the sentiment of the tweets is positive, neutral, or negative using Machine Learning Techniques, then categorizing negative tweets for their reason.

Data Analysis

import pandas as pd ## for reading and undestanding data
import matplotlib.pyplot as plt ## for plotting data
import seaborn as sns ## another library to visualize data features
import numpy as np ## for numerical array processing

##reading data
data=pd.read_csv('twitter-airline/Tweets.csv')

data[['airline_sentiment','negativereason','airline','retweet_count','tweet_created']].head()

	airline_sentiment	negativereason	airline	tweet_created
0	neutral	NaN	Virgin America	2015-02-24 11:35:52 -0800
1	positive	NaN	Virgin America	2015-02-24 11:15:59 -0800
2	neutral	NaN	Virgin America	2015-02-24 11:15:48 -0800
3	negative	Bad Flight	Virgin America	2015-02-24 11:15:36 -0800
4	negative	Can't Tell	Virgin America	2015-02-24 11:14:45 -0800

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 8 columns):
tweet_id                        14640 non-null int64
text                            14640 non-null object
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
airline                         14640 non-null object
retweet_count                   14640 non-null int64
tweet_created                   14640 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 915.1+ KB

semtiments=pd.crosstab(data.airline, data.airline_sentiment)
semtiments

airline_sentiment	negative	neutral	positive
airline
American	1960	463	336
Delta	955	723	544
Southwest	1186	664	570
US Airways	2263	381	269
United	2633	697	492
Virgin America	181	171	152

negative_tweet=data[(data['airline_sentiment']=='negative')]
negative_tweet[['airline','negativereason','text']].head()

	airline	negativereason	text
3	Virgin America	Bad Flight	@VirginAmerica it's really aggressive to blast...
4	Virgin America	Can't Tell	@VirginAmerica and it's a really big bad thing...
5	Virgin America	Can't Tell	@VirginAmerica seriously would pay $30 a fligh...
15	Virgin America	Late Flight	@VirginAmerica SFO-PDX schedule is still MIA.
17	Virgin America	Bad Flight	@VirginAmerica I flew from NYC to SFO last we...

negative_tweet.airline.value_counts() #counts number of negative rate for each airline to identify worest airway of 2015

United            2633
US Airways        2263
American          1960
Southwest         1186
Delta              955
Virgin America     181
Name: airline, dtype: int64

Most common words in negative tweets

from wordcloud import WordCloud
def plotWords(words):
    wordcloud=WordCloud(width=1200, height=600, random_state=21,max_font_size=110).generate(words)
    plt.figure(figsize=(10,7))
    plt.imshow(wordcloud,interpolation="bilinear")
    plt.axis('off')
    plt.show()

neg_tweet_words=negative_tweet.text.values.tolist()
neg_words=' '.join([text for text in neg_tweet_words])
plotWords(neg_words)

png

The plot is showing wich airline service is more tweeted for negative sentiment and reason for negativity.

Lets look at posetive comments to understand services on which customers are more satisfied.

posetive_tweet=data[(data['airline_sentiment']=='positive')]
pos_tweet_words=posetive_tweet.text.values.tolist()
pos_words=' '.join([text for text in pos_tweet_words])
plotWords(pos_words)

png

appreciate, good, thanks, really, great, amazing, best, nice, happy, … shows services on which customers are ok with airlines.

def plot_bar(title,x_label,y_label,data):
    fig, ax = plt.subplots(figsize=(10, 3))
    ax.tick_params(axis='x', labelsize=12)
    ax.tick_params(axis='y', labelsize=12)
    ax.set_ylabel(y_label , fontsize=12)
    ax.set_title(title, fontsize=15, fontweight='bold')
    _=data.plot(kind='bar')

reason_count=negative_tweet['negativereason'].value_counts()
_=reason_count.plot(kind='bar')

png

airline_neg_reason=negative_tweet.groupby('airline')['negativereason'].value_counts()

def plot_sns(x,y,data):
    sns.set(rc={'figure.figsize':(10,10)})
    ax=sns.countplot(y=y,hue=x,data=data)
    for p in ax.patches:
        patch_height = p.get_height()
        if np.isnan(patch_height):
            patch_height = 0
        ax.annotate('{}'.format(int(patch_height)), (p.get_x()+0.01, patch_height+0.5),ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
    plt.title("Distribution of negative reason for each airline")
    plt.show()
plot_sns('negativereason','airline',negative_tweet)


![png](/neg_service.png)


The plot and table above interstingly depicts the United, US, and American airlines has worest service than Delta, Virgin America, and Southwest airlines. Except, Delta and Virgin America airways, the rest four has no good customer handling and United and US airways also mostly late on flight time. Comaratively, Virgin America is good than other and then Delta is next choise.

# Does flight time has relation to negative reason?

We will focus on top three airlines with negative sentiment


```python
#time based analysis
data['tweet_created']=data['tweet_created'].astype('datetime64[ns]') ## conversion of data type to datetime
data[['airline','airline_sentiment','tweet_created','negativereason']].tail()

	airline	airline_sentiment	tweet_created	negativereason
14635	American	positive	2015-02-22 20:01:01	NaN
14636	American	negative	2015-02-22 19:59:46	Customer Service Issue
14637	American	neutral	2015-02-22 19:59:15	NaN
14638	American	negative	2015-02-22 19:59:02	Customer Service Issue
14639	American	neutral	2015-02-22 19:58:51	NaN

data['tweet_created_date']=data.tweet_created.dt.date
data['tweet_created_weekday_name']=data.tweet_created.dt.weekday_name
data['tweet_created_hour']=data.tweet_created.dt.hour
data[['airline','airline_sentiment','tweet_created_weekday_name','tweet_created_hour']].tail()

	airline	airline_sentiment	tweet_created_weekday_name	tweet_created_hour
14635	American	positive	Sunday	20
14636	American	negative	Sunday	19
14637	American	neutral	Sunday	19
14638	American	negative	Sunday	19
14639	American	neutral	Sunday	19

Negative reason of tweet vs day of the week. Which day flight has most negative indicator?

negative_tweet=data[(data['airline_sentiment']=='negative')]
neg_by_wkday = negative_tweet.groupby(['tweet_created_weekday_name']).negativereason.value_counts()

neg_by_wkday = neg_by_wkday.unstack().plot(kind='line',figsize=(10,5),rot=0,title="Negetive Reasons by Day of Week")
neg_by_wkday.set_xlabel("Day of Week")
neg_by_wkday.set_ylabel("Negative Reason")

Text(0, 0.5, 'Negative Reason')

png

The plot clearly depicts expect Friday, Saturday, Thursady and Wednesday flights are comaratively good. Monday, Sunday and Tuesday flights has customer service problem and are mostly late (the green lines also shows that probability of cancelation of flights by Monday, Sunday and Tuesday is high).

neg_by_time = negative_tweet.groupby(['tweet_created_hour']).negativereason.value_counts()

neg_by_time = neg_by_time.unstack().plot(kind='line',figsize=(10, 5),title="Negetive Reasons by Hour")
neg_by_time.set_xlabel("Time")
neg_by_time.set_ylabel("Negative Reason")

Text(0, 0.5, 'Negative Reason')

png

Time based analysis is showing something good look to optimize airline service.

Relatively good customer sutisfaction period (6 A.M to 10 A.M)
7:AM to 9:AM less customer service issue
1:pm to 7:pm almost no late flight
9:AM no cancelled flight

Flights at time range 0:00 A.M -03:00 A.M and 04:00 PM - 06:00 PM are with high customer dististfaction.