Airline Sentiment Analysis Project
Project Objective
- Analysing data to visualize airline trends
- What most posetive or negative tweets for each airline service talks about?
- Does time of flight affect quality service?
- Which reason commonly tweeted by customers for bad service?
- Counting for retweeted negative tweets to shows which service is highly affecting.
- Classifying whether the sentiment of the tweets is positive, neutral, or negative using Machine Learning Techniques, then categorizing negative tweets for their reason.
Data Analysis
import pandas as pd ## for reading and undestanding data
import matplotlib.pyplot as plt ## for plotting data
import seaborn as sns ## another library to visualize data features
import numpy as np ## for numerical array processing
##reading data
data=pd.read_csv('twitter-airline/Tweets.csv')
data[['airline_sentiment','negativereason','airline','retweet_count','tweet_created']].head()
airline_sentiment | negativereason | airline | retweet_count | tweet_created | |
---|---|---|---|---|---|
0 | neutral | NaN | Virgin America | 0 | 2015-02-24 11:35:52 -0800 |
1 | positive | NaN | Virgin America | 0 | 2015-02-24 11:15:59 -0800 |
2 | neutral | NaN | Virgin America | 0 | 2015-02-24 11:15:48 -0800 |
3 | negative | Bad Flight | Virgin America | 0 | 2015-02-24 11:15:36 -0800 |
4 | negative | Can't Tell | Virgin America | 0 | 2015-02-24 11:14:45 -0800 |
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 8 columns):
tweet_id 14640 non-null int64
text 14640 non-null object
airline_sentiment 14640 non-null object
airline_sentiment_confidence 14640 non-null float64
negativereason 9178 non-null object
airline 14640 non-null object
retweet_count 14640 non-null int64
tweet_created 14640 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 915.1+ KB
semtiments=pd.crosstab(data.airline, data.airline_sentiment)
semtiments
airline_sentiment | negative | neutral | positive |
---|---|---|---|
airline | |||
American | 1960 | 463 | 336 |
Delta | 955 | 723 | 544 |
Southwest | 1186 | 664 | 570 |
US Airways | 2263 | 381 | 269 |
United | 2633 | 697 | 492 |
Virgin America | 181 | 171 | 152 |
negative_tweet=data[(data['airline_sentiment']=='negative')]
negative_tweet[['airline','negativereason','text']].head()
airline | negativereason | text | |
---|---|---|---|
3 | Virgin America | Bad Flight | @VirginAmerica it's really aggressive to blast... |
4 | Virgin America | Can't Tell | @VirginAmerica and it's a really big bad thing... |
5 | Virgin America | Can't Tell | @VirginAmerica seriously would pay $30 a fligh... |
15 | Virgin America | Late Flight | @VirginAmerica SFO-PDX schedule is still MIA. |
17 | Virgin America | Bad Flight | @VirginAmerica I flew from NYC to SFO last we... |
negative_tweet.airline.value_counts() #counts number of negative rate for each airline to identify worest airway of 2015
United 2633
US Airways 2263
American 1960
Southwest 1186
Delta 955
Virgin America 181
Name: airline, dtype: int64
Most common words in negative tweets
from wordcloud import WordCloud
def plotWords(words):
wordcloud=WordCloud(width=1200, height=600, random_state=21,max_font_size=110).generate(words)
plt.figure(figsize=(10,7))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis('off')
plt.show()
neg_tweet_words=negative_tweet.text.values.tolist()
neg_words=' '.join([text for text in neg_tweet_words])
plotWords(neg_words)
The plot is showing wich airline service is more tweeted for negative sentiment and reason for negativity.
Lets look at posetive comments to understand services on which customers are more satisfied.
posetive_tweet=data[(data['airline_sentiment']=='positive')]
pos_tweet_words=posetive_tweet.text.values.tolist()
pos_words=' '.join([text for text in pos_tweet_words])
plotWords(pos_words)
appreciate, good, thanks, really, great, amazing, best, nice, happy, … shows services on which customers are ok with airlines.
def plot_bar(title,x_label,y_label,data):
fig, ax = plt.subplots(figsize=(10, 3))
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.set_ylabel(y_label , fontsize=12)
ax.set_title(title, fontsize=15, fontweight='bold')
_=data.plot(kind='bar')
reason_count=negative_tweet['negativereason'].value_counts()
_=reason_count.plot(kind='bar')
airline_neg_reason=negative_tweet.groupby('airline')['negativereason'].value_counts()
def plot_sns(x,y,data):
sns.set(rc={'figure.figsize':(10,10)})
ax=sns.countplot(y=y,hue=x,data=data)
for p in ax.patches:
patch_height = p.get_height()
if np.isnan(patch_height):
patch_height = 0
ax.annotate('{}'.format(int(patch_height)), (p.get_x()+0.01, patch_height+0.5),ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.title("Distribution of negative reason for each airline")
plt.show()
plot_sns('negativereason','airline',negative_tweet)
![png](/neg_service.png)
The plot and table above interstingly depicts the United, US, and American airlines has worest service than Delta, Virgin America, and Southwest airlines. Except, Delta and Virgin America airways, the rest four has no good customer handling and United and US airways also mostly late on flight time. Comaratively, Virgin America is good than other and then Delta is next choise.
# Does flight time has relation to negative reason?
We will focus on top three airlines with negative sentiment
```python
#time based analysis
data['tweet_created']=data['tweet_created'].astype('datetime64[ns]') ## conversion of data type to datetime
data[['airline','airline_sentiment','tweet_created','negativereason']].tail()
airline | airline_sentiment | tweet_created | negativereason | |
---|---|---|---|---|
14635 | American | positive | 2015-02-22 20:01:01 | NaN |
14636 | American | negative | 2015-02-22 19:59:46 | Customer Service Issue |
14637 | American | neutral | 2015-02-22 19:59:15 | NaN |
14638 | American | negative | 2015-02-22 19:59:02 | Customer Service Issue |
14639 | American | neutral | 2015-02-22 19:58:51 | NaN |
data['tweet_created_date']=data.tweet_created.dt.date
data['tweet_created_weekday_name']=data.tweet_created.dt.weekday_name
data['tweet_created_hour']=data.tweet_created.dt.hour
data[['airline','airline_sentiment','tweet_created_weekday_name','tweet_created_hour']].tail()
airline | airline_sentiment | tweet_created_weekday_name | tweet_created_hour | |
---|---|---|---|---|
14635 | American | positive | Sunday | 20 |
14636 | American | negative | Sunday | 19 |
14637 | American | neutral | Sunday | 19 |
14638 | American | negative | Sunday | 19 |
14639 | American | neutral | Sunday | 19 |
Negative reason of tweet vs day of the week. Which day flight has most negative indicator?
negative_tweet=data[(data['airline_sentiment']=='negative')]
neg_by_wkday = negative_tweet.groupby(['tweet_created_weekday_name']).negativereason.value_counts()
neg_by_wkday = neg_by_wkday.unstack().plot(kind='line',figsize=(10,5),rot=0,title="Negetive Reasons by Day of Week")
neg_by_wkday.set_xlabel("Day of Week")
neg_by_wkday.set_ylabel("Negative Reason")
Text(0, 0.5, 'Negative Reason')
The plot clearly depicts expect Friday, Saturday, Thursady and Wednesday flights are comaratively good. Monday, Sunday and Tuesday flights has customer service problem and are mostly late (the green lines also shows that probability of cancelation of flights by Monday, Sunday and Tuesday is high).
neg_by_time = negative_tweet.groupby(['tweet_created_hour']).negativereason.value_counts()
neg_by_time = neg_by_time.unstack().plot(kind='line',figsize=(10, 5),title="Negetive Reasons by Hour")
neg_by_time.set_xlabel("Time")
neg_by_time.set_ylabel("Negative Reason")
Text(0, 0.5, 'Negative Reason')
Time based analysis is showing something good look to optimize airline service.
- Relatively good customer sutisfaction period (6 A.M to 10 A.M)
- 7:AM to 9:AM less customer service issue
- 1:pm to 7:pm almost no late flight
- 9:AM no cancelled flight
Flights at time range 0:00 A.M -03:00 A.M and 04:00 PM - 06:00 PM are with high customer dististfaction.