Use Your Computer to Make Informed Decisions in Stock Trading: Practical Introduction — Part 5: Developing a Short Term Investment Strategy Based on Earnings-Per-Share (EPS) Data

18 min readNov 27, 2020

Introduction

If you’re following the stock market news, you will encounter the financial terms like revenue, net income, earnings-per-share (EPS). They often appear just after the quarterly and annual earning calls. The metrics reflect the immediate operational results of the company and analysts use them to calculate the long-term profitability trend which is concluded in a fair stock price.

In this article, we will take a closer look at the historical EPS for ~200 companies, selecting the most active stocks during the last trading day (which normally include the largest “blue chips” and some other less popular companies). We get the stock price data around the dates of quarterly reports and calculate the price increase or decrease just after the the results announcement. The main idea is to check the long-term performance of EPS predicted vs. actual EPS and its influence on the stock price.

In the course of this article, we’ll try to find answer to the following questions:

Do all companies try to report very close to estimates? What is big enough Surprise(%) to cause a shock in a short-time stock prices?
If the spike caused a rapid change in the stock price, does it remain for a several days and can we predict the direction of its movement (so that one can use the knowledge to make a short term investments)?
Does consistent reporting of a slight positive surprise bring to long-term stocks growth?
Does a gradual increase in EPS increase the valuation of a company?

The ultimate goal is to find a segment of stocks, which has a high probability of growth during the several days after the announcement of quarterly results.

This article is the fifth part in the series. The previous part 4 was an introduction to the topic: it showed how to scrape one page from finance.yahoo.com on earnings-per-share and concentrated on one reporting period (earning reports in Aug’20, which covered Q2'20 revenues). This part is much more advanced: it takes the whole available history for the selected set of stocks and aims to find investment opportunities rather than a simpler exploratory analysis.

Preparation Steps

As in the previous parts of this series, we will use a Google Colab notebook (colab.research.google.com) to implement the Python code needed for the research.

To start with, we’ll need a standard set of analytical, scrapping, and finance imports:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Then, we create a set of tickers to be used in our analysis:

ADDITIONAL_TICKERS = [‘CAT’, ‘SNAP’, ‘TSLA’, ‘GE’, ‘GOOG’, ‘FB’, ‘AMD’,’ATVI’,’SAP’,’EBAY’]

Functions For Scraping

Since we’re going to obtain some data for our analysis by scraping, we need to create some code for this. Reusing the approach from Part 4 on scrapping one page with one table, and cleaning it up (we need to clean the missing values), we create the get_scraped_yahoo_finance_page() function that works as follows:

Takes URL and URL_PARAMs as input arguments
Finds one existing table in the HTML table
Saves the table column manes (there is one hack for Most-active page, as one column is not visible)
Retrieves the values row-by-row
Returns a dataframe with all fields as an object

We also need to create clean_earnings_history_df() that removes all cells with missing values (‘-’)

Here is the implementation of these functions:

# Example of a full URL with one stock Symbol
# url = “https://finance.yahoo.com/calendar/earnings?symbol=ATVI"def get_scraped_yahoo_finance_page(url, url_params):
  # url = “https://finance.yahoo.com/calendar/earnings"
  # url_params = {‘symbol’:ticker}
  r = requests.get(url, params=url_params)
  soup = BeautifulSoup(r.text)
  table = soup.find_all(‘table’)
  # DEBUG: Just 1 table found which is good
  print(‘We\’ve found tables:’, len(table))
  if len(table)!=1:
    return None
  # Get all column names
  if len(soup.table.find_all(‘thead’))==0:
    return None
  spans = soup.table.thead.find_all(‘span’)
  columns = []
  for span in spans:
    columns.append(span.text)
    # Hack: due to some reason one of the columns are not <span> tag   for calendar/earnings and is not discovered in the columns list. 
    # we manually add this column 
  if url.find(‘https://finance.yahoo.com/most-active') != -1 :
    columns.insert(len(columns)-2,”Market Cap”)
  rows = soup.table.tbody.find_all(‘tr’) 
  # read row by row
  stocks_df = pd.DataFrame(columns=columns)
  for row in rows:
    elems = row.find_all(‘td’)
    dict_to_add = {}
    for i,elem in enumerate(elems):
      dict_to_add[columns[i]] = elem.text
    stocks_df = stocks_df.append(dict_to_add, ignore_index=True) 
  return stocks_df# The only record per stock appears with this pattern : the next earnings date
def get_next_earnings_records(stocks_df):
  filter1 = stocks_df[‘EPS Estimate’]!=’-’
  filter2 = stocks_df[‘Surprise(%)’]==’-’
  filter3 = stocks_df[‘Reported EPS’]==’-’
  rez_df = stocks_df[filter1 & filter2 & filter3]
  return rez_df# Remove all records with not filled stats and cast to float values
def clean_earnings_history_df(stocks_df):
  filter1 = stocks_df[‘EPS Estimate’]!=’-’
  filter2 = stocks_df[‘Surprise(%)’]!=’-’
  filter3 = stocks_df[‘Reported EPS’]!=’-’
  stocks_df_noMissing = stocks_df[filter1 & filter2 & filter3]
  stocks_df_noMissing[‘EPS Estimate’] = stocks_df_noMissing[‘EPS Estimate’].astype(float)
  stocks_df_noMissing[‘Reported EPS’] = stocks_df_noMissing[‘Reported EPS’].astype(float)
  stocks_df_noMissing[‘Surprise(%)’] = stocks_df_noMissing[‘Surprise(%)’].astype(float)
  return stocks_df_noMissing

Let’s now try out the above functions, getting the stats for F (Ford):

ticker = 'F'f_df = get_scraped_yahoo_finance_page(url = "https://finance.yahoo.com/calendar/earnings",url_params = {'symbol':ticker})f_df_clean = clean_earnings_history_df(f_df)

The result set generated for Ford shows that Ford had great results recently:

Fig. 1 Ford company (F) last 5 EPS results

Getting the Top 200 Traded Stocks

Here we’ll look at the 200 most active stocks during the last trading day (mid-Nov 2020). The list (in conjunction with the selected stocks) will be used to get the historical values of EPS and future returns on the dates of EPS (to test the hypothesis if good or bad EPS predict good returns). We will apply the regex approach to convert all text values to numeric symbols (% values and magnitude M, B, T).

num_stocks = 200
most_active_stocks = get_scraped_yahoo_finance_page(url = “https://finance.yahoo.com/most-active",url_params = {‘count’:num_stocks})

Fig. 2 Top 10 stocks for the last trading (20 Nov. 2020)

The problem is that all the values in the dataframe are objects. Not integers or floats, meaning you can’t make arithmetic operations on them.

import rePOWERS = {‘T’: 10 ** 12,’B’: 10 ** 9, ‘M’: 10 ** 6, ‘%’: 0.01, ‘1’:1}“””Read a string (with M/B/T/% values)
 Return a correct numeric value
“””
def convert_str_to_num(num_str):
  match = re.search(r”([0–9\.-]+)(M|B|T|%)?”, num_str)
  if match is None:
    return None
  else: 
    quantity = match.group(1)
    if match.group(2) is None:
      magnitude = ‘1’ # no letter in the end->don’t change the value
   else: 
     magnitude = match.group(2)
     # print(quantity, magnitude)
   return float(quantity) * POWERS[magnitude]

We apply the above function column-by-column to the dataframe to convert its object values to numbers:

columns_to_apply = [ ‘Price (Intraday)’, ‘Change’, ‘% Change’, ‘Volume’, ‘Avg Vol (3 month)’, ‘Market Cap’,
 ‘PE Ratio (TTM)’]for col in columns_to_apply:
  most_active_stocks[col] = most_active_stocks[col].apply(convert_str_to_num)

As a result, we can apply some math to the dataframe:

most_active_stocks[“log_market_cap”] = np.log10(most_active_stocks[“Market Cap”])

And build a histogram:

Fig.4 Most traded stocks are large on Market Cap: $B (10⁹) to $T (10¹²)

Let’s now divide the active stocks on 3 equally sized groups:

most_active_stocks[“log_market_cap_binned”] = pd.qcut(most_active_stocks.log_market_cap,3)

As a result, we have 3 approximately equal groups on size 66–67 stocks:

Fig.5 Three equal size clusters of the most traded stocks

We can now find average values for these groups on the following:
1) % Change: largest stocks change -0.4%, while smaller stocks 1.3%
2) Total volume for clusters can be comparable (the same power of 10), but market cap differs 5–10x each (6.4*10⁹ vs. 2.7*10¹⁰ vs. 2.17*10¹¹)
3) P/E Ratio (where filled): is high for the largest stocks (78), moderate for smaller (51), and highest for small (104)

Fig6. Average values for 3 clusters of stocks by market cap

You might also want to remove outliers: top-xx % from each sides. This can be implemented with the following function:

def remove_outliers(df, column_name, quantile_threshold):
  q_low = df[column_name].quantile(quantile_threshold)
  q_hi = df[column_name].quantile(1-quantile_threshold)
  rez = df[(df[column_name] < q_hi) & (df[column_name] > q_low)]
return rez

And then put it in use as follows:

tmp = remove_outliers(df = most_active_stocks, column_name = “% Change”,quantile_threshold = 0.02)
tmp[“% Change”].hist(bins=100)

Fig.7 One day %Change happen to be mostly between -5% to 5% (can differ for other trading days)

tmp = remove_outliers(df = most_active_stocks, column_name = “Change”,quantile_threshold = 0.02)# The difference vs. previous graph: we draw the abs. daily change here vs. relative change in % in the previous graph
tmp[“Change”].hist(bins=100)

Fig.8 Abs. daily Change is concentrated around 0 in normal trading days

# We want to get some idea of the stock was traded today vs. 3-month average volumemost_active_stocks[“relative_volume”] = most_active_stocks[“Volume”]/most_active_stocks[“Avg Vol (3 month)”]

Let’s now summarise what we have so far, listing our findings:

[[RESULT 1]] : we split companies on medium, large, and largest (by log_market_cap_binned). In most cases the first day change lies between -5% and +5%, while medium and small companies can drop/rise even further to -10/+10%. It is just one trading day, but it can give a general sense how stock can change its value in one day. On volume, largest companies are rarely traded more that 3x times more than 3-months average, while smaller companies can have up to 5–6x volume.

Let’s build a diagram to illustrate the point:

import seaborn as sns
sns.set(rc={‘figure.figsize’:(10,6)})
sns.kdeplot(
 data = most_active_stocks, x=”% Change”, y=”relative_volume”, hue=”log_market_cap_binned”, fill=True,
)

Fig.9 One day % Change vs. relative_volume (1 day volume/3-months avg. volume)

The lower is the market cap for the stock (light blue colour) — the bigger is the potential % Change (blue shape range at the horizontal axis) and the relative volume (blue shape range at the vertical axis).

[[RESULT 2]] The medium and large companies tend to have more negative and positive range of returns and form a ‘wider’ bell — with a higher standard deviation from the mean. An Investor you can earn more with smaller companies, but actually risk more too. So, the return-per-risk can be anything: smaller or bigger for the medium-large-largest stocks.

sns.set(rc={‘figure.figsize’:(10,6)})sns.kdeplot(
 data = most_active_stocks, x=”% Change”, hue=”log_market_cap_binned”,
 fill=True, common_norm=False, 
 # palette=”rocket”,
 alpha=.5, 
 linewidth=0,
)

Fig.10 Histogram of 1-day %Change for three classes of stocks

Getting a Dataframe with All EPS Historical for the Most Traded Stocks

We downloaded all available dates for EPS (earnings-per-share) for the most traded stocks:

Fig.11 200 most active stocks for Friday, 20 Nov. 2020

There are also tickers in the ADDITIONAL_TICKERS list defined in the beginning of this article. These tickers might not appear in the most traded stocks list:

NEW_TICKERS = [x for x in ADDITIONAL_TICKERS if x not in set(most_active_stocks.Symbol)]TICKERS_LIST = most_active_stocks.Symbol.append(pd.Series(NEW_TICKERS))

In the following code, we scrape info for each ticker from https://finance.yahoo.com/calendar/earnings

from random import randint
from time import sleep# Empty dataframe
all_tickers_info = pd.DataFrame({‘A’ : []})for i,ticker in enumerate(TICKERS_LIST):
  current_ticker_info = get_scraped_yahoo_finance_page(url = “https://finance.yahoo.com/calendar/earnings",url_params = {‘symbol’:ticker})
  print(f’Finished with ticker {ticker}, record no {i}’)
  if all_tickers_info.empty:
  all_tickers_info = current_ticker_info
  else:
   all_tickers_info =    pd.concat([all_tickers_info,current_ticker_info], ignore_index=True)  # Random sleep 1–3 sec
  sleep(randint(1,3))

Before proceeding, we need to calculate the closest future earnings dates:

next_earnings_dates = get_next_earnings_records(all_tickers_info)

We know when is the next earnings date, and want to know what to expect from the closest rev meetings?

next_earnings_dates.sort_values(by=’Earnings Date’).tail(30)

Fig.12 Next reporting dates are only in January 2021

In the next step, we remove all cells with missing values (‘-’):

all_tickers_info_clean = clean_earnings_history_df(all_tickers_info)

Now let’s check what we have at the aggregate level:

[[RESULT 3]]: An average EPS estimate is only 0.52, which is not far from the reality 0.50 => it is only1.2% surprise in an average case. The standard deviation for the Surprise is 386% — which says that there are many outliers in the dataset:
- small values EPS (<0.13) tend to under-report on EPS (-1% surprise),
- medium (EPS=0.38) the slightly overreport 4% higher than estimate,
- the highest quantile (EPS>0.75) they over-report in average 13%

Fig.15 Actual EPS figures are close to the estimates during the whole history

Probably (to be checked), this “normal” points when EPS Estimate is close to the Reported EPS won’t give us the outstanding returns when everything is happening as predicted. So we can see from this graph (box plot), that there are many outliers with highly positive or negative EPS Estimate/Reported EPS.

all_tickers_info_clean[[‘EPS Estimate’,’Reported EPS’]].plot.box()

Fig. 16 EPS Estimate/Reported EPS have many outliers in both positive and negative directions

If we remove 360 entries (out of 12636 values) with extreme values (abs. z-score>2) — then we can have a better view on the box-plot : less outliers in both directions remain:

import scipy
# calculate z-scores of `df`
z_scores = scipy.stats.zscore(all_tickers_info_clean[[‘EPS Estimate’,’Reported EPS’]])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 2).all(axis=1)
new_df = all_tickers_info_clean[[‘EPS Estimate’,’Reported EPS’]][filtered_entries]new_df.plot.box()

Now we need to convert dates in the dataframe to simplify further analysis:

from datetime import datetime 
from datetime import timedelta
all_tickers_info_clean[‘Earnings Date 2’] = all_tickers_info_clean[‘Earnings Date’].apply(lambda x:datetime.strptime(x[:-3], ‘%b %d, %Y, %H %p’) )

Next, we can generate the PRIMARY KEY (PK) to be used in merge operations afterwards, using string of the earnings date without time and the ticker symbol.

all_tickers_info_clean[‘PK’] = all_tickers_info_clean.Symbol + “|”+ all_tickers_info_clean[“Earnings Date 2”].apply(lambda x : x.strftime(‘%Y-%m-%d’))

Fig.18 Historical EPS for top traded stocks and the Primary Key (PK)

Getting All Available History of Stock Prices for the Selected Tickers

If you’re in Colab, start with installing yfinance in your notebook:

!pip install yfinance
import yfinance as yf

In the following code, we generate a table with daily prices and future returns (in 1–7,30,90,360 days):

# Start from an empty dataframe
df_stocks_prices = pd.DataFrame({‘A’ : []})
# Download all history of stock prices and calculate the future returns for 1–7 days, 30d, 90d, 365d 
# That is: we are very interested if we buy stock at some date (e.g. high EPS) -> if it is going to be a profitable decision
for i,ticker in enumerate(TICKERS_LIST):
 yf_ticker = yf.Ticker(ticker)
 historyPrices = yf_ticker.history(period=’max’)
 historyPrices[‘Ticker’] = ticker
 # Sometimes there is a problem with .index value → use try
 # https://stackoverflow.com/questions/610883/how-to-know-if-an-object-has-an-attribute-in-python
 try:
 historyPrices[‘Year’]= historyPrices.index.year
 historyPrices[‘Month’] = historyPrices.index.month
 historyPrices[‘Weekday’] = historyPrices.index.weekday
 historyPrices[‘Date’] = historyPrices.index.date
 except AttributeError:
 print(historyPrices.index)
 # !!! Important: we do historyPrices[‘Close’].shift(1) — to get the Close market price 1 day BEFORE current
 # !!! Important: we do historyPrices[‘Close’].shift(-i)) — to get the Close market price the i days AFTER current
 # If you divide second on first -> you get the returns from holding “i days” the stock that you bought the day before financial reporting occurred
 for i in [1,2,3,4,5,6,7,30,90,365]:
 historyPrices[‘r_future’+str(i)] = np.log(historyPrices[‘Close’].shift(-i) / historyPrices[‘Close’].shift(1) )historyPrices[‘years_from_now’] = historyPrices[‘Year’].max()- historyPrices[‘Year’]
 historyPrices[‘ln_volume’]= np.log(historyPrices[‘Volume’])
 if df_stocks_prices.empty:
 df_stocks_prices = historyPrices
 else: 
 df_stocks_prices = pd.concat([df_stocks_prices,historyPrices], ignore_index=True)

We generate the same PRIMARY KEY to (inner) join with the dataframe of EPS : <Symbol | Date>:

df_stocks_prices[“PK”] = df_stocks_prices.Ticker + “|”+ df_stocks_prices[“Date”].apply(lambda x : x.strftime(‘%Y-%m-%d’))

We’ve generated a lot of daily stats (1.2M records for 200 stocks) for the financial performance of the selected stocks.

Let’s now look at an example of how r_future1 and r_future2 are calculated for one stock. For that:
• lets select the second row for the date ‘2020–10–02’ : GE Close price for 2020–10–01 was 6.24, for 10–05 (the next trading day after 10–02) was 6.41, for 10–06 was 6.17
• r_future1 = log(6.41 / 6.24) = log(1.027) = 0,026 — that is approximately 2.6% returns from buying stock GE on 1-Oct and selling it 5th Oct (1 trading days after the reporting date)
• r_future2 = log(6.17 / 6.24) = log(0.988) = -0,011 — that is approximately -1.1% returns (loss) from buying stock GE on 1-Oct and selling it 6-th Oct (2 trading days after the reporting date)

filter1 = df_stocks_prices.Ticker==’GE’
filter2 = df_stocks_prices.Year == 2020
filter3 = df_stocks_prices.Month == 10
df_stocks_prices[filter1 & filter2 & filter3].head(2)

Fig 19. Truncated (on rows) dataset for a daily stock prices and their future returns

We still may have duplicates in the all_tickers_info_clean dataset due to the double records in the original website:

all_tickers_info_clean = all_tickers_info_clean.drop_duplicates(subset=[‘PK’], keep=’first’)

We also need to remove duplicates from df_stocks_prices in case we have them:

df_stocks_prices = df_stocks_prices.drop_duplicates(subset=[‘PK’], keep=’first’)

Now we can try to merge these dataframes. We use one-to-one validation to make sure there are no duplicates:

merged_df = pd.merge(all_tickers_info_clean, df_stocks_prices, on=”PK”, validate=”one_to_one”)

The resulting dataframe should have the following structure:

It is back to normal size of 12k rows, because we store only the records for the financial reporting days, reducing the volume 100x: from 1.2M to 12k rows.

Individual Stocks Examples: Recent Spikes in Q3 and Q2 Reports

In particular, we’ll cover the following tickets, taking respective indicators from the merged_df dataframe:

• [GE] shares jump after company posts surprise adjusted third-quarter profit, revenue tops expectations (https://www.cnbc.com/2020/10/28/general-electric-ge-earnings-q3-2020.html)

• [MSFT] Microsoft’s stock rises after company reports 15% sales jump and says coronavirus had ‘minimal’ impact on revenue (https://www.cnbc.com/2020/04/29/microsoft-msft-earnings-q3-2020.html#:~:text=Earnings%3A%20%241.40%20per%20share%2C%20adjusted,Revenue%3A%20%2435.02%20billion)

To perform the analysis, we’ll use the following function:

def draw_plot(symbol):
  filter = (merged_df.Symbol== symbol) & (merged_df.Year>=2010)
  df = merged_df[filter][[“EPS Estimate”,”Reported EPS”,”Surprise(%)”,”r_future1",”r_future7",”Date”]]
  with pd.option_context(‘display.max_rows’, None, ‘display.max_columns’, None): # more options can be specified also
  
  print(df.head(15))
  
  #Graph1: EPS estimate vs. Reported EPS 
  df[[“EPS Estimate”,”Reported EPS”,”Date”]].plot.line(x=”Date”, figsize=(20,6), title=” EPS Estimate vs. Reported”)
  
  #Graph2: Surprise in %
  df[[“Surprise(%)”,”Date”]].plot.line(x=”Date”, figsize=(20,6), title=”Surprise % (=Reported EPS/EPS Estimate)”)  #Graph3: 1- and 7-days returns
df[[“r_future1”,”r_future7",”Date”]].sort_values(by=’Date’).plot(x=”Date”, kind=’bar’, figsize=(20,6), title=”Stock jump”)

Let’s take a glimpse at GE stock:

In most of the periods Reported EPS is higher than Predicted EPS
Four datapoints (quarters) show the negative Surprise, which didn’t cause a big shock for a stock price for the first time, but caused a 10–20% dip for other times
The last report Q3'20 showed a positive dynamics moving EPS from a negative to positive values, which resulted in 3.7% rise in the first day, and 13% stock rise in 7 days
The last quarter results pose the following investment idea: if the Surprise (EPS actual vs. EPS predicted) is positive and big, then the stock can jump in one day (r1_future) and continue its growth for the whole week after that (r7_future). An investor can monitor such occasions and buy a stock just after the very successful reporting date aiming to sell it in a short period of time.

draw_plot(“GE”)

Fig.20 GE stock: EPS Estimate vs. Reported EPS

Fig.22 GE stock: Stock price jump 1 and 7 days after the reporting date

draw_plot(“MSFT”)

Let’s look at MSFT stock:

Microsoft (MSFT)tend to have the the actual EPS different from an estimates on -10% .. +20%
There was one date in the end of 2017 with surprise 40%, which did cause a positive spike in returns, although r1 and r7 are not very different (which probably happened due to low absolute value for EPS <0.01)
Last 6 quarters the stock showed the gradual EPS rising trend, with r7_future was always higher than r1_future. Which means that MSFT stock was a good opportunity to invest all quarters last 1.5 years.
Investment idea: try to find stocks with growing EPS over 1–2 years period of time and invest into it around the reporting day.

Fig.23 MSFT stock: EPS Estimate vs. Reported EPS

Fig.22 MSFT stock: Stock price jump 1 and 7 days after the reporting date

Scaled Analysis

In this section, we’ll examine the aggregated statistics across many years of data and different dimensions to group by.

The first example is 1 and 7-days returns for all stocks grouped by year

import matplotlib.pyplot as pltprint(‘Count observations: ‘,merged_df.groupby(by=’Year’).count()[‘r_future1’])ax = merged_df.groupby(by=’Year’).mean()[[‘r_future1’,’r_future7']].plot.line(figsize=(20,6))
vals = ax.get_yticks()
ax.set_yticklabels([‘{:,.1%}’.format(x) for x in vals])
plt.axhline(y=0, color=’r’, linestyle=’-’)
plt.title(“1 and 7 days returns of stocks after the quarterly earnings results announcement”)

Fig.23 Aggregated r_future1 and r_future7

[[RESULT 4]] For many years (but not always!) the expected returns after 7 days is higher than after 1 day from the reporting date. This means that individual trends that we’ve seen earlier for MSFT and GE tend to be generally used for the larger dataset, but only within successful “bullish” years (when r_future1>0 in average).

Parametrise the function (with filtering and groupby conditions)

Here we create a parametrised version of the previous graph, in which you can select a feature for groupby, and condition to filter.

def draw_returns(groupby_factor, filter): 
  filter_year = merged_df.Year>=2000
  print(‘Count observations: ‘,merged_df[filter & filter_year].groupby(by=groupby_factor).count()[‘r_future1’])
  ax = merged_df[filter & filter_year].groupby(by=groupby_factor).mean()[[‘r_future1’,’r_future7']].plot.line(figsize=(20,6))
 vals = ax.get_yticks()
  ax.set_yticklabels([‘{:,.0%}’.format(x) for x in vals])
  plt.axhline(y=0, color=’r’, linestyle=’-’)
  plt.title(“1 and 7 days returns of stocks after the quarterly earnings results announcement”)
  if groupby_factor==’Year’:
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))

[[RESULT 5]] We try the parametrised approach, getting the returns r1 and r7 for different classes of stocks (EPS<-1, EPS<0, EPS>0, EPS>1, EPS>2 etc.). We want to find out that one line (r7) is always higher than another line (r1), but can’t actually prove that it is true, unfortunately.

draw_returns(‘Year’, merged_df[“Reported EPS”]<0)
draw_returns(‘Year’, merged_df[“Reported EPS”]<-1)
draw_returns(‘Year’, merged_df[“Reported EPS”]>0)
draw_returns(‘Year’, merged_df[“Reported EPS”]>1)
draw_returns(‘Year’, merged_df[“Reported EPS”]>2)

Slicing the Data on the Volume of Trade

You might also want to try to slice the data on the volume of trade /size of a company and see if there is some spectacular behaviour for any of the subgroup.

merged_df.ln_volume.replace([np.inf, -np.inf], np.nan).hist(bins=100)

Fig.29 Ln daily volume of trade distribution

We create 10 equal size bins:

merged_df[“ln_volume_binned”] = pd.qcut(merged_df[“ln_volume”],10)
merged_df.ln_volume_binned.value_counts()(17.674, 21.613] 1226 
(6.396, 14.344] 1226 
(17.111, 17.674] 1225 
(16.74, 17.111] 1225 
(16.436, 16.74] 1225 
(16.157, 16.436] 1225 
(15.85, 16.157] 1225 
(15.49, 15.85] 1225 
(15.037, 15.49] 1225 
(14.344, 15.037] 1225 Name: ln_volume_binned, dtype: int64

Now we can use the same parametrised approach to slice data on ln_volume_binned and selecting different filters (e.g. merged_df[“Year”]>2000 (Fig.30),
merged_df[“Year”]==2020 (Fig.31),
(merged_df[“Year”]==2020) & (merged_df[“Reported EPS”]>0) (Fig.32):

draw_returns(‘ln_volume_binned’, merged_df[“Year”]>2000)
draw_returns(‘ln_volume_binned’, merged_df[“Year”]==2020)
draw_returns(‘ln_volume_binned’, (merged_df[“Year”]==2020) & (merged_df[“Reported EPS”]<0))

Fig 30. r1_future and r7_future returns split by volume of trade for year>2000

Fig 31. R1 and R7 returns split by volume of trade for year==2020

Fig 32. R1 and R7 returns split by volume of trade, year==2020, and reported_EPS<0

The first graph shows that the average stock with large volume of trade were negative on returns in average for the period 2000–2020. In 2020 the trend reversed: the large stocks are positive in returns and r7 is higher than r1. If we look deeper in 2020 and add the condition EPS<0 then we will see the larger positive difference for r7 and r1 returns (which is good if you buy at day 1 and sell at day 7).

[[RESULT 6]] We tried to split the stocks by the volume of trade, year, and EPS. There are no universal trends (when one line r7 lies above another line r1), but few (weaker) observations persist. In 2020: a stock with a high volume of trade tend to be positive in short-term returns (contrary to the years before that), and stocks with a negative EPS tend to return quicker to the previous (pre-reporting) prices and grow beyond that.

Analysing the Surprise % Value

The Surprise is largely concentrated around 0, as many companies want to report the very close value:

merged_df[“surprise_%_binned”] = pd.qcut(merged_df[“Surprise(%)”],10)
merged_df[“surprise_%_binned”].sort_values().value_counts()(-31360.231, -17.356] 1226
(-17.356, -3.39] 1226
(-3.39, 0.27] 1229 
(0.27, 1.87] 1221 
(1.87, 3.73] 1226 
(3.73, 6.446] 1223 
(6.446, 10.677] 1225
(10.677, 17.52] 1226 
(17.52, 36.469] 1224
(36.469, 6900.0] 1226 
Name: surprise_%_binned, dtype: int64

Let’s draw the returns:

draw_returns(‘surprise_%_binned’, True)

Fig.33 R1 and R7 returns split by the Surprise%

[[RESULT 7]] The general rule is that if the Surprise(%) is negative, then r1 and r7 are negative too. If the surprise is positive — r1 and r7 are positive too. It is hard to use the Surprise(%) factor as an Investment idea, as lines r1 and r7 lie close to each other. They seem to move apart only starting for high positive Surprise(%)>17. The average number of such cases is 20% (two highest bins out of ten), which means that an investor needs to monitor a lot of stock earning reports dates to catch the highest positive cases.

Conclusion

We’ve used the different free datasources to create a dataframe with the stats about top 200 largest (on size and volume of trade) companies traded on the US stock market.

The article has a set of techniques to scrape the data from web, clean and transform the data, join different datasets on a primary key, generate new features and visualise the results in parametrised manner.

The research findings are (from the outlined [[RESULT1-RESULT7]] paragraphs of text):

1 day change in price of a stock around the reporting date lies in the interval [-5%.. +5%]. Smaller stocks tend to have larger jumps (and potential profits for an investor), but they bear the larger risk as well;
Average reported EPS is $0.52 which is close to the estimated EPS $0.50. The most of observations have only a few percents difference between these values (Surprise(%) <2% in many cases). There are many outliers in the dataset with extremely high/low absolute values of EPS or Surprise(%);
There are some long-term trends for individual stocks (like GE and MSFT), when a gradual increase in EPS makes investors happy and raises the r1 and r7 prices for several quarters in a row. These trends can be generalised to the “successful” years when average r1_future>0 and average r7_future>0;
It is hard to find a consistent cluster of stocks (with some conditions on EPS, Surprise, year, volume, etc.) which will have the risk-less profit (returns after 7 days are higher than after 1 day: r7>r1);
There are certain thresholds around -+10% for Surprise(%), which cause a massive sell or buy behaviour from investors and move the prices to a negative or positive side. If the Surprise(%) is greater than 17% (~20% of all data points in 20 years), then the average difference between r7 and r1 is positive: Investors continue to buy a rising stock during not only one day after the announcement, but a longer period of time (one week or more).