Use Your Computer to Make Informed Decisions in Stock Trading: Practical Introduction — Part 3: Sentiment Analysis of Financial News

13 min readJul 28, 2020

Sunset in Courtmacsherry, south of Ireland

Is there a relation between all that hype in the media regarding a certain company and how the company is actually doing? Can you actually rely on the headlines to make a decision on whether to buy, sell or hold the stock? In this article, we’ll try to answer this question: can the stock market be influenced by the news? In particular, we will try to automatically get the list of news using a news API, apply sentiment analysis, and compare the results with the stock prices. Moreover, we will scale the approach: get daily news about stocks and compare its sentiment versus S&P 500 index performance.

This is the third part in the series on how to take advantage of computer technologies to make informed decisions in stock trading. The part 1 guided you through the process of setting up the working environment needed to follow along with the examples provided in the rest of the series. Then, in the previous part 2, you explored several well-known finance APIs, allowing you to obtain and analyse stock data programmatically.

Sentiment Analysis of News

If we are talking about a well-known company, then it’s quite common that any significant thing about the company (a new contract, a new business line, a strong executive manager hired, mergers/acquisitions/partnerships etc.) or its financial results (quarterly and annual earnings, profits, earnings per share, etc.) are covered by the media. The idea is the following: we can automatically — with the help of a news API — check out news articles about a certain company, which were published within a specified interval, say, on the day before the company’s annual general meeting, on the day of the event, and the next day.

Then, with the help of natural language processing (NLP) techniques, such as sentiment analysis, we can programmatically figure out what emotions prevail (positive, negative or neutral) in those articles. Since sentiment analysis provides a way to represent emotions numerically, you’ll be able to compare the overall sentiment for a certain company for a specified period with the stock’s price performance.

News coverage is far more than just a source of facts. Actually, news can shape our views of many things around us and finance is no exception. There is a possible connection between a stock’s price jump and the news: either the news can cause the stock jump, or they can explain it afterwards. While it is hard to tell which news in particular had a strong influence on a stock’s price, we suggest there are a lot of “emotional” traders who make a judgement based on the polarity of news coverage.

News API

For the purpose of our project in this article, you can use the News API, which lets you get the most relevant news articles about stocks in general or about a certain company in particular. A request to the API is similar to a web search but allows you to narrow down the results being retrieved by specifying the publication interval for the required articles. The News API is easy to use (with direct HTTP request or Python wrapper library), although it has limitations in a number of calls (250 requests available every 12 hours) and only one month of historical data available for FREE.

Before you can use News API, you’ll need to obtain an API key. This can be done for free at https://newsapi.org/. After that, you can start sending requests to the API. So, before going any further, let’s make sure that the API works as expected.

As usual, create a new notebook in Google Colab. If you forgot how to do it, check with part 1. Then, install the newsapi-python Python wrapper for the News API in the notebook.

!pip install newsapi-python

After that, insert and run the following code in a code cell:

from newsapi import NewsApiClient
from datetime import date, timedeltaphrase = ‘Apple stock’
newsapi = NewsApiClient(api_key=’your_news_api_key_here’)
my_date = date.today() — timedelta(days = 7)articles = newsapi.get_everything(q=phrase,
                                  from_param = my_date.isoformat(),
                                  language=”en”,
                                  sort_by=”relevancy”,
                                  page_size = 5)for article in articles[‘articles’]:
  print(article[‘title’]+ ‘ | ‘ + article[‘publishedAt’] + ‘ | ‘ + article[‘url’])

This should give you the titles, publication dates, and links for 5 news articles about Apple stock, published in the last 7 days. There is also an article description which we don’t print for now, but it will be used for the sentiment analysis.

So the output might look as follows:

Daily Crunch: Apple commits to carbon neutrality | 2020–07–21T22:10:52Z | http://techcrunch.com/2020/07/21/daily-crunch-apple-commits-to-carbon-neutrality/Daily Crunch: Slack files antitrust complaint against Microsoft | 2020–07–22T22:16:02Z | http://techcrunch.com/2020/07/22/daily-crunch-slack-microsoft-antitrust/Jamf ups its IPO range, now targets a valuation of up to $2.7B | 2020–07–20T17:04:25Z | http://techcrunch.com/2020/07/20/jamf-ups-its-ipo-range-now-targets-a-valuation-of-up-to-2-7b/S&P 500 turns positive for 2020, but most stocks are missing the party — Reuters | 2020–07–21T19:45:00Z | https://www.reuters.com/article/us-usa-stocks-performance-idUSKCN24M2RDAvoid Apple stock as uncertainties from coronavirus weigh on iPhone launch, Goldman Sachs says | 2020–07–23T13:50:13Z | https://www.businessinsider.com/apple-stock-price-rally-risk-coronavirus-iphone-delay-earnings-goldman-2020-7

As you can see there are only two directly related articles, but other three have mentioned the Apple company name inside and remain somewhat relevant.

You can also try to search something directly in the title, by passing the qInTitle param instead of q in the function call (documentation link), but there is a caveat that it is not implemented in the Python wrapper library and you will need to make HTTP request to the API instead of a simpler method.

Still the question remains open “What articles should be selected for the analysis and from what sources?”

VADER Sentiment Analysis

The crucial piece in the article is to understand what actually the sentiment analysis is and why it works here. The best source is the original article “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text”.

I will summarise its main principles here:

Polarity of a text is summarised from the polarity of individual words (e.g. positive: love, nice, good, great — 406 words; negative: hurt, ugly, sad, bad, worse — 499 words).
Strength of sentiment and intensity is applied (e.g. degree modifiers: the good is extremely good).
Human-curated gold-standard resources: 20 people were hired to evaluate the predictions on the different types of text (tweets, reviews, tech, and news). Opinion news articles: included 5190 sentence-level snippets from 500 New York Times opinion editorials. VADER showed the highest correlation with human scores among all tested approaches on all types of text.
Context awareness: e.g. the word catch has negative sentiment in “At first glance the contract looks good, but there’s a catch”, but is neutral in “The fisherman plans to sell his catch at the market”.
Punctuation: e.g. the exclamation point (!) and CAPITALISATION increase the magnitude of the intensity without modifying the semantic orientation. For example, “The food here is good!!!” is more intense than “The food here is good.”.
Machine learning is used to improve all of the above (Naive Bayes classifier).

After applying these rules to the text one sentiment prediction is calculated, which is a value between -1 (strong negative) to +1 (strong positive).

Performing Sentiment Analysis for News

Let’s create a new notebook for this project (a single script actually). To improve readability, we’ll place the code within several code cells. In the first one, install the newsapi-python library in the notebook, just as you did for the test discussed in the previous section:

!pip install newsapi-python

You will also need to install yfinance library to access Yahoo Finance API covered in part 2:

!pip install yfinance

The next is the import section to include all the required libraries:

import sys
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from newsapi import NewsApiClient
from datetime import date, timedelta, datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import yfinance as yfsia = SentimentIntensityAnalyzer()

Let’s download the lexicon needed to run the Vader sentiment analysis:

nltk.download(‘vader_lexicon’)

In another cell, make sure to set the following option for pandas to see the full output in Colab:

pd.set_option(‘display.max_colwidth’,1000)

To start with, let’s download some news on a search keyword for a specific date, sorted on relevancy for language = en, first 100 articles.

We’ll use this function to call one end point to filter proper sources. Put it into another cell:

def get_sources(category = None):  newsapi = NewsApiClient(api_key=’your api_key_here’)
  sources = newsapi.get_sources()  if category is not None:
    rez = [source[‘id’] for source in sources[‘sources’] if source[‘category’] == category and source[‘language’] == ‘en’]
  else:
    rez = [source[‘id’] for source in sources[‘sources’] if source[‘language’] == ‘en’]
  
  return rez

Let’s now check out how many English (en) sources are available:

len(get_sources())81

And what we have business sources:

#Get the list of the business news sources
get_sources('business')
['australian-financial-review',
'bloomberg',
'business-insider',
'business-insider-uk',
'financial-post',
'fortune',
'the-wall-street-journal']

Then, you define the function, which provides the implementation of the algorithm, calculating the sentiment figures.

def get_articles_sentiments(keywrd, startd, sources_list = None, show_all_articles = False):  newsapi = NewsApiClient(api_key='your_api_key_here')  if type(startd)== str :
    my_date = datetime.strptime(startd,'%d-%b-%Y')
  else:
    my_date = startd  #If the sources list is provided - use it
  if sources_list:
    articles = newsapi.get_everything(q = keywrd, from_param = my_date.isoformat(), to = (my_date + timedelta(days = 1)).isoformat(), language="en", sources = ",".join(sources_list), sort_by="relevancy", page_size = 100)  else:
    articles = newsapi.get_everything(q = keywrd, from_param = my_date.isoformat(),to = (my_date + timedelta(days = 1)).isoformat(), language="en", sort_by="relevancy", page_size = 100)  article_content = ''
  date_sentiments = {}
  date_sentiments_list = []
  seen = set()  for article in articles['articles']:
    if str(article['title']) in seen:
      continue
    else:
      seen.add(str(article['title']))    article_content = str(article['title']) + '. ' + str(article['description'])    #Get the sentiment score
    sentiment = sia.polarity_scores(article_content)['compound']
  
    date_sentiments.setdefault(my_date, []).append(sentiment)    date_sentiments_list.append((sentiment, article['url'],article['title'],article['description']))    date_sentiments_l = sorted(date_sentiments_list, key = lambda tup: tup[0],reverse = True)    sent_list = list(date_sentiments.values())[0]    #Return a dataframe with all sentiment scores and articles  
    return pd.DataFrame(date_sentiments_list, columns=['Sentiment','URL','Title','Description'])

You can now perform some tests using the above function. First, we’ll look at how it works for all news found for the keyword ‘stock’, for a certain date, and for ALL ‘en’ sources:

return_articles = get_articles_sentiments(keywrd= 'stock' ,startd = '21-Jul-2020',sources_list = None, show_all_articles= True)return_articles.Sentiment.hist(bins=30,grid=False)
print(return_articles.Sentiment.mean())
print(return_articles.Sentiment.count())
print(return_articles.Description)

As a result, you will see 100 articles with a lot of neutral sentiment, and it is skewed towards very positive.

This is what a fragment from the list of found articles might look like (top two negative articles):

return_articles.sort_values(by='Sentiment', ascending=True)[['Sentiment','URL']].head(2)Sentiment    URL58    -0.9062    https://www.reuters.com/article/india-nepal-palmoil-idUSL3N2ES1Y359    -0.8360    https://in.reuters.com/article/volvocars-results-idINKCN24M1D7

If you visit the first link (https://www.reuters.com/article/india-nepal-palmoil-idUSL3N2ES1Y3 ), you’ll find: ‘Nepal stops buying (New Dehli Suspended 39 oil import…)’, which says it all.

You might want to look at the same list sorted in ascending order to see the articles with the highest sentiment ranks first:

return_articles.sort_values(by=’Sentiment’, ascending=True)[[‘Sentiment’,’URL’]].tail(2)
Sentiment URL37 0.9382 https://www.reuters.com/article/japan-stocks-midday-idUSL3N2ES06S40 0.9559 https://www.marketwatch.com/story/best-buy-says-sales-are-better-during-pandemic-stock-heads-toward-all-time-high-2020-07-21

From the article above: “TOKYO, July 21 (Reuters) — Japanese stocks rose on Tuesday as signs of progress in developing a COVID-19 vaccine boosted investor confidence in the outlook for future economic growth.”

Let’s now look for articles about the stock for the same date but in business sources only:

sources = get_sources(‘business’)return_articles = get_articles_sentiments(‘stock’,’21-Jul-2020',sources_list = sources, show_all_articles = True)return_articles.Sentiment.hist(bins = 30, grid = False)
print(return_articles.Sentiment.mean())print(return_articles.Sentiment.count())print(return_articles.Description)

This is what the output might look like, starting with the overall sentiment rank:

#Mean sentiment on 67 business articles
0.13#Articles from the business sources
67#Articles description examples
0 <ul>\n<li>Tesla CEO Elon Musk appears to have unlocked the second of his compensation goals on Tuesday. </li>\n<li>Despite a slight dip Tuesday, the company’s average market cap has been above $150 billion for long enough to unlock the second tranche of stock a…1 <ul>\n<li>There’s a lot riding on Tesla’s second-quarter earnings report Wednesday afternoon.</li>\n<li>Analysts expect the company to post a $75 million loss for the three-month period ended June 31.</li>\n<li>Despite factory shutdowns and falling deliveries, t…2 <ul>\n<li>Tesla reports its highly anticipated second-quarter earnings on Wednesday after market close. </li>\n<li>The report comes after the automaker’s second-quarter vehicle delivery numbers beat Wall Street expectations. </li>\n<li>Investors and analysts wil…...

21-Jul-2020, sentiment on 67 articles about stocks (business sources)

You can compare the results with a previous day, if you take all news about stocks:

return_articles = get_articles_sentiments(‘stock’,’20-Jul-2020',show_all_articles=True)return_articles.Sentiment.hist(bins = 30, grid = False)
return_articles.Sentiment.mean()

Since we analyse all sources, you should find similar results from the first 100 articles on stocks:

#Mean sentiment on 100 articles
0.22501616161616164

20-Jul-2020, Sentiment distribution for 1 day of stock news for all sources (top 100 articles sentiment)

It is more articles (100 vs. 67 from business sources), so the mean sentiment should contain more signals from various sources. The problem is that now that it can have smaller newspapers news (that don’t have a wide audience of people who trade).

You may try to find the correlation of a stock price/ index price with other metrics like top negative score, top positive score, top negative — top positive, or median sentiment over all articles. We will continue using the mean() estimate for the rest of the code.

Now let’s check the whole month: get top daily news and sentiments about the stock market from all sources and business newspapers:

#FREE NewsAPI allows to retrieve only 1 month of news dataend_date = date.today()
start_date = date(year=end.year, month=end.month-1, day=end.day)print(‘Start day = ‘, start_date)
print(‘End day = ‘, end_date)current_day = start_date
business_sources = get_sources(‘business’)
sentiment_all_score = []
sentiment_business_score = []dates=[]while current_day <= end_date:  dates.append(current_day)  sentiments_all = get_articles_sentiments(keywrd= 'stock' ,
startd = current_day, sources_list = None, show_all_articles= True)  sentiment_all_score.append(sentiments_all.mean())  sentiments_business = get_articles_sentiments(keywrd= 'stock' , startd = current_day, sources_list = business_sources, show_all_articles= True)  sentiment_business_score.append(sentiments_business.mean())
  
  current_day = current_day + timedelta(days=1)

You might want to compare the overall sentiment figures for the articles retrieved with and without ‘business’ category filter. For that, we’ll create a pandas dataframe as follows:

sentiments = pd.DataFrame([dates,np.array(sentiment_all_score),np.array(sentiment_business_score)]).transpose()sentiments.columns =[‘Date’,’All_sources_sentiment’,’Business_sources_sentiment’]sentiments[‘Date’] = pd.to_datetime(sentiments[‘Date’])sentiments[‘All_sources_sentiment’] = sentiments[‘All_sources_sentiment’].astype(float)sentiments[‘Business_sources_sentiment’] = sentiments[‘Business_sources_sentiment’].astype(float)

Before going any further, let’s look at the structure of the dataframe we finally got:

sentiments.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 3 columns):
#   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
0   Date                     31 non-null  datetime64[ns]
1   All_sources_sentiment    31 non-null  float64
2   Business_sources_sentiment  31 non-null  float64
dtypes: datetime64[ns](1), float64(2)
memory usage: 872.0 bytes

Now let’s make the Date column as an index to be able to join it with other data sources:

sentiments.set_index(“Date”, inplace=True)
sentiments.head()Date   All_sources_sentiment    Business_sources_sentiment2020-06-21    0.209889    0.111956
2020-06-22    0.219228    0.155876
2020-06-23    0.115508    0.102921
2020-06-24    0.084642    0.017751
2020-06-25    0.155524    0.005206

OK, now that we have daily sentiment figures for 1 month, why don’t we compare them with real market figures for this same period, say, with S&P500 index?

Checking Daily Stock News Sentiment vs. Growth of S&P500 Index

As you might recall from the discussion on S&P500 index in part 2, it can be obtained as follows:

import pandas_datareader.data as pdrend = date.today()
start = datetime(year=end.year, month=end.month-1, day=end.day)print(f'Period 1 month until today: {start} to {end} ')Period 1 month until today: 2020-06-21 00:00:00 to 2020-07-21

Now we can obtain the index daily close prices:

spx_index = pdr.get_data_stooq(‘^SPX’, start, end)
spx_index.index
DatetimeIndex([‘2020–07–21’, ‘2020–07–20’, ‘2020–07–17’, ‘2020–07–16’, ‘2020–07–15’, ‘2020–07–14’, ‘2020–07–13’, ‘2020–07–10’, ‘2020–07–09’, ‘2020–07–08’, ‘2020–07–07’, ‘2020–07–06’, ‘2020–07–02’, ‘2020–07–01’, ‘2020–06–30’, ‘2020–06–29’, ‘2020–06–26’, ‘2020–06–25’, ‘2020–06–24’, ‘2020–06–23’, ‘2020–06–22’], dtype=’datetime64[ns]’, name=’Date’, freq=None)

In the next step, you might want to make a plot with the S&P500 data:

spx_index[‘Close’].plot(title=’1 month price history for index S&P500 Index’)

S&P 500 index history, 21-June-2020 to 21-July-2020

Now let’s join our sentiment data with S&P500 index data:

sentiments_vs_snp = sentiments.join(spx_index['Close']).dropna()sentiments_vs_snp.rename(columns={'Close':'s&p500_close'}, inplace=True)sentiments_vs_snp.head()Date    All_sources_sentiment  Business_sources_sentiment    s&p500_close2020-06-22    0.219228    0.155876    3117.86
2020-06-23    0.115508    0.102921    3131.29
2020-06-24    0.084642    0.017751    3050.33
2020-06-25    0.155524    0.005206    3083.76
2020-06-26    0.124339    0.008645    3009.05

How would both the sentiment from all news sources and S&P500 data look on the same plot (left axis = S&P500 index, right axis = Avg. news sentiment) ?

import matplotlib.pyplot as plt
import seaborn as snssns.set(rc={‘figure.figsize’:(13.0,8.0)})ax=sns.lineplot(data=sentiments_vs_snp[‘s&p500_close’], color=”b”,label=’S&P500 Close price’)ax2 = plt.twinx()
sns.lineplot(data=sentiments_vs_snp[“All_sources_sentiment”], color=”g”, ax=ax2, label=’All sources sentiment’)

S&P500 index data (blue line, left axis) vs. All sources mean news sentiment (green line, right axis)

You might also want to compare the sentiment figures obtained from the articles in the business category with the S&P500 data:

sns.set(rc={‘figure.figsize’:(13.0,8.0)})ax=sns.lineplot(data=sentiments_vs_snp[‘s&p500_close’], color=”b”, label=’S&P500 Close price’)ax2 = plt.twinx()sns.lineplot(data=sentiments_vs_snp[“Business_sources_sentiment”], color=”g”, ax=ax2, label=’Business_sources_sentiment’)

S&P500 index data (blue line, left axis) vs. Business sources mean news sentiment (green line, right axis)

As you can see, business sentiment figures look closer to the S&P500 data: they tend to move in the same direction.

Conclusion

In this third part of the series, we tried to find out whether the stock market was influenced by the news.

You looked at the code that can get the relevant set of news to the stock market, determine the sentiment of each article and average it across all news, and finally compare the sentiment movement with the stock index direction. The same methodology can be applied to an individual company (that has many mentions in the articles) and its shares price. Say, if you observe the price jump and want to quickly understand the possible reasons for it — you can simply get the most polar articles (negative for the price decline, and positive for the price increase) using the code above.

At the whole market level the business sources tend to have more correlation with the S&P 500 index price movement.

In the next section you will see whether the expectations from financial analysts matter a lot to a stock’s price. You will build a simple scrapper to obtain the needed dataset from the Web and check the hypothesis of a strong bond between the financial performance and the stock market.