Use Your Computer to Make Informed Decisions in Stock Trading: Practical Introduction — Part 13: Leveraging OpenAI’s API for Financial News Summarization
[HINT] You can read my previous articles on Medium or on the website PythonInvest (including ALL weekly releases of AI-generated news summaries from this algorithm). The complete Python code (Colab notebooks) can be found on Github.
In the world of stock markets and broad indexes like the S&P 500, casual investors face challenges due to limited time and biassed news coverage. To address this, we propose a data-driven approach that focuses on stocks with over 10 news stories in a week, helping investors identify crucial events and trends. By analysing current market sentiment and predicting future developments, readers can make more informed investment decisions. Additionally, we explore ‘star’ companies with high media attention, allowing investors to learn beyond their professional bubble and discover potential hidden gems for growth.
Summary of Results
01. Prioritising High-Media Attention Stocks for Quality News Coverage
Our data-driven approach targets stocks with over 10 weekly news stories, empowering investors to identify key market events and trends efficiently.
02. Generating daily and weekly market summaries from 300-2000 news
Producing bullet-point summaries using few-shots learning summarisation.
03.Using OpenAI's ChatGPT
Utilising OpenAI’s API documentation for optimal summary: selecting roles and prompts, measuring execution time, and total running cost.
04.Comparing GPT3.5-Turbo and GPT4
GPT3.5-Turbo-16k is 10x cheaper, allows 2x more input text, but probably provides worse results
05. Uncertainty remains
Getting an accurate summary is tough because we can’t be certain of its quality, coverage of important news, or how to compare it to other summaries generated by various models and prompts, whether they have length constraints or not. Nonetheless, our approach consistently produces promising results and usually includes all major news.
The HighLevel Approach
Let’s review the high-level plan:
- Create a Polygon.AI account for news access (paid with a FREE tier).
- Establish an OpenAI account (charges apply per API call, including advanced GPT-4 models).
- Secure API keys using environment variables.
Google Drive Integration (APIs secrets):
- Utilise your Google Drive in tandem with Colab.
- Register 2 API secret codes stored in Drive for Colab’s use.
- Address authentication warnings and access notifications.
News Retrieval and Analysis:
- Test Polygon.AI’s API for financial news (up to 5000 weekly items).
- Craft a function for 5 API calls, data parsing, correction, and feature extraction.
- Categorise news into market and individual segments.
OpenAI API Exploration:
- Develop an OpenAI API wrapper handling consecutive calls and errors.
- Interpret system messages for summarization and few-shot summaries.
Automation and Execution:
- Execute Colab daily for freshest news data (latest 5k).
- Implement special Monday execution for weekly summary.
- Manage cost by restricting weekly summary extraction to Mondays (up to $2 per full Colab run).
Polygon.io News API
Let’s start by checking out the API’s website (Ticker News endpoint) and trying out some test calls directly from there. This will help us learn how to put together the full web request, including all the details we need. Our goal is to get the most up-to-date financial news in one go without any overlap from different calls. You should end up with something like this (https://api.polygon.io/v2/reference/news?order=desc&limit=1000&sort=published_utc&published_utc.lte=2023-07-07T00:00:00-04:00&apiKey=<your key>)
Each piece of news from the API comes with a bunch of information. We’re interested in a few parts: the title, when it was published, the tickers (like symbols for stocks), and a short description. We’ll focus on these details because they give us a good overview of what the news is about, without having to read the whole article.
When we look at news, we see that some of it is about individual companies, and some is about the whole market or multiple companies. We can tell by looking at the tickers. If there’s only one ticker, it’s usually about a single company, which is usually very accurate news. But if there’s more than one ticker, it’s more about the general market. There are a few important tickers like SPY, QQQ, and DJI that might seem like individual companies, but they’re actually about the whole market.
The places where we get news from are quite limited. The main ones are Zacks Investment Research (1.9k news per week), GlobeNewswire Inc (1k news per week), The Motley Fool (0.8k news per week), and Benzinga (0.8k news per week).
Let’s take an example news piece :
Let’s prepare our first input (only 4 fields are used: time, tickers, title, description) for the summarisation call, which concerns one stock, ‘CRM,’ and just four news articles. Please read it carefully, as it will be used for the initial summarization later. Note that there may be a lot of meaningless words and grammar issues, and we never specify the structure of the input; it is simply unstructured text separated by ‘|’ between fields and by commas between news articles:
“2023–08–28T13:30:06Z| [‘CRM’]| Should You Invest in Salesforce.com (CRM) Based on Bullish Wall Street Views?| According to the average brokerage recommendation (ABR), one should invest in Salesforce.com (CRM). It is debatable whether this highly sought-after metric is effective because Wall Street analysts’ recommendations tend to be overly optimistic. Would it be worth investing in the stock?”,
(three more news — full text below with the Prompt)
The last thing we need to do is to split the text into several chunks if it’s too long. Let’s perform a simple calculation: GPT-3.5Turbo-16k can accept up to 16,000 tokens as an input, which is approximately equivalent to 12,000 English words (as stated on openai.com/pricing: ‘For English text, 1 token is approximately 4 characters or 0.75 words’).
Given that we need to generate summaries, which can sometimes be quite lengthy, we should aim for a maximum input of 9,000 English words. This translates to about 18 pages of text.
The GPT-4–8k model is half the size, so we would limit the input to only 4,500 words or 9 pages of text.
Typically, we have no more than 50 news articles per stock, and their summaries can be generated in a single API call. However, on a given day, there may be between 230 to 260 news articles, requiring multiple calls. For a weekly summary (comprising 2,200 to 2,400 news articles), it could necessitate 15 to 20 calls to the GPT-3.5Turbo-16k model.
To prepare these inputs, we divide the text into ‘chunks’ that are sequentially processed in the API calls. Through empirical testing, I’ve found that a maximum of 6,000 words (approximately 100 news articles) works well for the GPT-3.5Turbo-16k model, while 3,000 words is suitable for the GPT4–8k model. While this may not be the strict optimal threshold, it has proven effective in my experience, and the code does not encounter crashes during execution.
It’s important to note that when using the OpenAI API to generate summaries, the tokens used in those summaries also count towards your token usage quota. Tokens are the fundamental units of text processing that both the input text and the generated output consist of.
For instance, when you provide a news article to the API for summarization, the tokens in the original article count towards your token usage. However, the tokens generated in the summary by the API are also counted separately. This means that not only do you need to consider the length of the input text but also the length of the generated summary when calculating your token consumption.
Therefore, if you have a limited token budget or a specific token usage quota, it’s essential to factor in both the input text and the expected length of the summaries you wish to generate. This consideration ensures that you can effectively manage and optimise your token usage when working with the OpenAI API.
Here is an example of producing chunks of input data for a one-week run. The numbers in brackets indicate the places where we define the border between two adjacent chunks. One chunk is one OpenAI’s API call with close to maximum tokens used:
Chunks for market summary last day: [0, 96, 222]
Chunks for market summary last day for GPT-4: [0, 38, 96, 155, 222]
Chunks for market summary last week: [0, 96, 222, 367, 483, 608, 681, 809, 931, 1034, 1145, 1292, 1402, 1504, 1625, 1730, 1851, 1988, 2107]
OpenAI’s ChatGPT API
Let’s move on from the previous passage when we have a number of lines of semi-structured text, which we need to summarise. The important concept to consider is this:
Before you dive into coding, here are some thoughts:
- Read this introductory article about the ChatGPT API. It covers models like GPT-3.5-Turbo and GPT-4, the roles to interact with the model (system, user, assistant), sample code for calling the API, and a few key learning concepts.
- Check out the recent announcement (July 6, 2023) regarding the general availability of the GPT-4 API. However, please note that access to the GPT-4–32k API is not available until further notice. This means we’ll need to make at least twice as many calls to the GPT-4–8k model compared to the GPT-3.5-Turbo-16k model.
- An important factor to consider is pricing, which is still relatively high, especially for GPT-4. As of August 28, 2023:
- $0.003 per 1,000 tokens for the GPT-3.5-Turbo-16k model
- $0.03 per 1,000 tokens for GPT-4–8k. GPT-4–8k is 10 times more expensive to call. This is precisely why we will use GPT-4 only to produce a daily summary of news, which typically involves around 10 times fewer inputs (usually around 300 news) than a weekly analysis.
- Another limiting factor to consider is the maximum capacity of tokens used. It’s 16,000 tokens for GPT-3.5-Turbo-16k (equivalent to about 20 pages of text) and 8,000 tokens for GPT-4–8k. This means we need to make twice as many calls with GPT-4 to summarise the same lengthy text. Given that GPT-4 is still 10 times more expensive, the overall cost of using GPT-4 for everything is 20 times higher than using GPT-3.5-Turbo-16k. This is why we primarily use GPT-3.5-Turbo-16k, even though the quality of summarization appears to be higher with GPT-4–8k.
- The total cost of running a weekly newsletter (approximately 25 calls with near-maximum inputs on GPT-3.5-Turbo-16k) is about $1.2 per run. If I add three calls of GPT-4–8k, I will end up around $2 per run of analysis.
- After experiencing several timeouts and other API error messages, I’ve copied an API wrapper function [COLAB LINK] (from here) with OpenAI’s error handling and repeated call attempts when the service is not available. As you might have seen earlier for Polygon.AI’s API, we need to deal with API errors, timeouts, and repetitive calls to make the code robust and error-free.”
Here is what I obtained by requesting a text summary of four news articles on the stock CRM:
“Summarize the news:
2023–08–28T13:30:06Z| [‘CRM’]| Should You Invest in Salesforce.com (CRM) Based on Bullish Wall Street Views?| According to the average brokerage recommendation (ABR), one should invest in Salesforce.com (CRM). It is debatable whether this highly sought-after metric is effective because Wall Street analysts’ recommendations tend to be overly optimistic. Would it be worth investing in the stock?;
“2023–08–24T18:03:58Z| [‘CRM’]| What To Expect From Salesforce Q2? Analyst Sees Mixed Business Trends| Oppenheimer\xa0analyst Brian Schwartz reiterated an Outperform rating on\xa0Salesforce, Inc.\xa0(NYSE: CRM) with a\xa0price target of $235.\nThe earnings risk for CRM weighs slightly positive ahead of 2Q results despite his recent field checks pointing to mixed business trends for Salesforce and with investor expectations at higher levels for this earnings report.\xa0\nPositively, the analyst detected improvement in the demand environment and possibly stabilizing trends in the sales operations for Salesforce in 2Q.\xa0\nAlso, the estimated bar is low for Salesforce this quarter. Schwartz thinks there are enough positives in the business trends and bookings performance detected in his field checks, Marketing Cloud customer survey, the set-ups, and read-throughs from company …Full story available on Benzinga.com”,
“2023–08–22T21:45:13Z| [‘CRM’]| Salesforce.com (CRM) Dips More Than Broader Markets: What You Should Know| In the latest trading session, Salesforce.com (CRM) closed at $206.76, marking a -1.09% move from the previous day.”,
“2023–08–21T13:00:11Z| [‘CRM’]| Investors Heavily Search Salesforce Inc. (CRM): Here is What You Need to Know| Salesforce.com (CRM) has received quite a bit of attention from Zacks.com users lately. Therefore, it is wise to be aware of the facts that can impact the stock’s prospects.”
Salesforce.com (CRM) is receiving bullish views from Wall Street analysts, with the average brokerage recommendation suggesting that investors should invest in the stock. However, there is debate over the effectiveness of this metric as analysts’ recommendations tend to be overly optimistic. Despite mixed business trends, Oppenheimer analyst Brian Schwartz maintains an Outperform rating on CRM with a price target of $235. He notes improvement in the demand environment and possibly stabilising trends in sales operations for the company. Salesforce.com’s stock closed at $206.76, down 1.09% from the previous day. The stock has been receiving attention from investors, and it is important to be aware of factors that can impact its prospects.
Individual Stocks Summary
Now, let’s explore how the algorithm operates on a larger scale. In this section, we adhere to the following guiding principles:
- We select articles that mention only one ticker in the description. This ensures that our content is focused on a single company, which has consistently proven to produce the highest quality content generated by the code.
- We prioritise stocks with more than 10 news articles in the description. Our weekly article consumption capacity is limited, so we avoid overloading it with smaller coverage news.
- There are exceptions for tickers like ‘SPY,’ ‘DJI,’ and ‘QQQ’ (also found in ETFs with the same names). We report on them even if there are fewer than 10 news articles per week, as these tickers are significant indicators of the broader market.
- Stocks receiving more than 10 news articles per week are typically large ‘blue-chip’ companies, often within the Technology sector. With the volume of news they generate, summarization becomes an ideal task for distilling the information efficiently.
- Occasionally, smaller, less ‘hyped’ stocks gain significant attention. It’s valuable to quickly learn about these companies and understand why they’ve attracted so much media coverage, especially when they are typically underreported.
This section is a personal favourite for me, as I enjoy keeping up with the tech industry and discovering new companies to explore.
In this post, I’ll showcase coverage of one particular stock, summarising 52 news articles into a few sentences (the first line is generated by human):
NEWS SUMMARY for (‘NVDA’, 52), which changed on 1.78% last trading day:
Nvidia (NVDA) has been receiving positive attention from investors and analysts. The company’s strong financial performance and above-average growth make it an appealing investment option. Despite failing to acquire Arm for $40 billion, Nvidia’s success and hype around AI have been cited as reasons to be optimistic about the stock. Wall Street analysts have given overly optimistic recommendations for NVDA, raising questions about the effectiveness of these recommendations. Overall, Nvidia is well-positioned for future growth and is viewed favorably by both investors and analysts.
Daily Market Summary (Approximately 230 News Articles)
Many news articles contain multiple tickers, whether it’s related to market-wide announcements, sector-wide updates, or shared attributes among a few stocks (e.g., competitors or ‘penny’ stocks).
The approach remains consistent: we compile these articles into a single corpus of text input and then divide it into manageable chunks. For a typical one-day summarization, we usually require 2–3 API calls for the GPT-3.5-Turbo-16k model and 4–6 calls for the GPT-4–8k model.
Here’s an example of chunk intervals for the daily summary of August 28, 2023:
Chunks of news :[0, 38, 96, 155, 222]
Analyzing MARKET news for chunk 0 (news range: (0, 38)): total count of words for a current chunk 2943
Analyzing MARKET news for chunk 1 (news range: (38, 96)): total count of words for a current chunk 2886
Analyzing MARKET news for chunk 2 (news range: (96, 155)): total count of words for a current chunk 2906
Analyzing MARKET news for chunk 3 (news range: (155, 222)): total count of words for a current chunk 2925
We utilise both the GPT-3.5-Turbo-16k and GPT-4–8k models to generate two summaries, as they often differ in terms of highlights and the quality of the summary.
However, we refrain from using the GPT-4–8k model, which offers superior quality, due to its higher cost. Running the GPT-4–8k model can result in expenses of up to $20 per run (for the summary of 5k news), and the time required is significantly longer, taking 20 seconds to make 2 calls for GPT-3.5-Turbo-16k versus 3 minutes and 30 seconds to make 4 calls to GPT-4–8k. This translates to approximately 10 seconds per call for GPT-3.5-Turbo compared to 45 seconds per call for GPT-4.
Comparing the quality of the summary can be challenging, as it requires reading 250+ news articles, understanding their significance, and manually summarising them for comparison with the automatic summary. To facilitate this evaluation, we provide both summaries, generated by GPT-3.5-Turbo and GPT-4, for you to make a comparison:
[GPT 3.5 Summary] (full article link):
MARKET NEWS SUMMARY (‘multiple_tickers’, 231) for the last 24 hours from 28/08/2023 20:59 UTC time:
- The stock market is consolidating slightly above support levels, with investors cautious after Jerome Powell’s speech at Jackson Hole and focusing on the narrative of no recession.
- Economic data has been strong, postponing the possibility of a recession for now.
- Lower income households are maintaining spending by borrowing while higher income households continue to spend as stock prices and house prices remain high.
- Excessive government borrowing and spending is providing more stimulus to the economy than expected.
- Recession probability is estimated at 45%, soft landing at 35%, and no landing at 20%.
- Nikola stock and Plug Power stock are growing rapidly in terms of revenue but are also seeing increasing losses.
- Federal Reserve Chair Jerome Powell warned of potential interest rate hikes until inflation is within the target range.
- U.S. stocks opened higher on Monday, with the Dow Jones up 0.81% and the NASDAQ up 0.61%.
- Kimco Realty announced a merger agreement to acquire RPT Realty in an all-stock transaction worth $2 billion.
- Hersha Hospitality Trust announced a merger with KSL Capital Partners.
- Mobile Infrastructure Corporation’s stock shot up 76% after the launch of its live-streaming e-commerce business on TikTok.
- Sherwin-Williams shares have seen a 21% increase in the past six months.
- Analysts recommend buying three penny stocks with price targets of up to 1,494%.
- Uber Eats is preparing to roll out an AI-powered chatbot to help users decide what to eat.
- American Airlines was fined $4.1 million for leaving passengers stranded on the tarmac.
- Five Below is expected to report quarterly revenue of $758.3 million and Dollar General is expected to report revenue of $9.9 billion.
- MACOM Technology Solutions Holdings was upgraded to Buy by Benchmark after its acquisition of Wolfspeed’s radio frequency business.
- NovoCure shares plunged while Mobile Infrastructure and GD Culture Group saw their shares spike higher.
[GPT 4 Summary] (full article link):
MARKET NEWS SUMMARY (‘multiple_tickers’, 231) for the last 24 hours from 28/08/2023 20:59 UTC time:
- Consumer-centric stocks LYV, RCL, DKNG, MAR, PEP continue to show strong potential for 2023.
- Oncology treatment market is experiencing significant growth, with companies REGN, MRK, NVS and AZN taking the lead.
- CHK and LNG are fundamentally sound investments amid uncertainty within the natural gas market.
- Three biotech stocks AGEN, BCRX and NNOX highlighted as attractive buys.
- High demand trends have allegedly boosted Ciena’s (CIEN) fiscal Q3 performance.
- Lennar and Quanta Services are part of the Zacks Earnings Preview article.
- Novo Nordisk stock’s latest financial report suggests it’s only getting started.
- Wall Street delivered mixed performances last week due to rising rates.
- Caterpillar (CAT), Applied Materials (AMAT), Toll Brothers (TOL), Walmart (WMT) and Dr. Reddy’s (RDY) are solid choices for investors amid market weakness.
- XPeng (XPEV) plans to accelerate adoption of Smart EVs and technologies through an alliance with DiDi.
- A “bear market”, where stock prices fall at least 20% from their peak, is expected according to some analysts.
- Healthcare technology, building technology, and security solutions are three industries recommended for investment.
- Warren Buffett has bought three homebuilder stocks, one of which would be a winner for patient, long-term investors.
- An uptick in insider trading has been reported amongst various stocks.
- Zacks has highlighted Palo Alto Networks and Ciena as bull and bear of the day, respectively.
- A rise in high-yielding stocks in the healthcare sector was reported. — Wall Street analysts targeting high-yield dividend stocks in the health care sector revealed Pfizer Inc. as a primary pick.
- A look at the most overbought stocks in the materials sector shows potential for investment in overvalued companies.
- Nvidia stocks are poised to rally around 20% according to an analyst. — For investors in penny stocks, 2023 is offering fresh opportunities. — Amidst inflationary issues, real estate industry presents an attractive proposition for patient, long-term followers of Warren Buffet investing model.
- Fidelity MSCI Utilities Index ETF (FUTY) and S&P SmallCap 600 Revenue ETF (RWJ) are being eyed for potential investment as cash-like ETFs hit a 52-week high.
- PetVivo Holdings announced it has achieved national distribution for its product Spryng to over 500 veterinary clinics across 49 states.
- Novo Nordisk’s strong stock performance continues, with predictions of more growth to come.
- Nvidia executives remain confident about the company’s market position in spite of ongoing chip supply issues.
- In the tech sector, Alphabet and Adobe are identified as potential growth stocks to invest in 2023.
- The automotive industry is set for a shake-up with the increased adoption of EV technologies, as evidenced by XPEng’s recent partnership with DiDi.
- The global brand FMC has made notable strides in the global herbicide product market.
- The biggest stock on Wall Street is reportedly not Amazon or Nvidia, but remains undisclosed.
- Shell is reportedly in talks with China’s Sinopec for the sale of Shell’s Bukom oil refinery in Singapore.
- Warren Buffett’s interest in the homebuilding sector indicates a likely upward trend for the industry.
- Investment experts warn against an emerging trend of overbought stocks in the materials sector.
- With interest rates poised to rise further this year, cash-like ETFs have hit a 52-week high.
- According to a recent note from Citigroup Global Markets, investors should not invest in “those inferior foreign stocks”.
Weekly Market summary (Approximately 2200 News Articles)
Now, let’s delve into the most complex aspect of the summarization process as we analyse a substantial corpus of text through 15–20 API calls. Our goal is to progressively refine the previous summary with new input text. This approach is known as ‘few-shot learning,’ and the concept behind it is as follows:
- Prompt_1: ‘Provide a summary for this text: <input text chunk 1>’
- Prompt_2: ‘Enhance the previous summary <summary from chunk 1> with this new article: <input text chunk 2>’
- Prompt_n: ‘Enhance the summary of chunks 1..n-1 with this new input: <input text chunk n>’
The summary generated after the last Prompt is the resulting text that we obtain and publish on our website.
Here are a few bullet points from it:
- Amphastar Pharma (AMPH) shares are up 89.1% year to date due to the acquisition of Baqsimi from Lilly and FDA approval for Naloxone.
- Capital One confirms another big sale of its office loans as fallout in the sector intensifies in the face of higher interest rates and tumbling property values.
- SpartanNash (SPTN) posts higher sales for the second quarter of 2023 on increased sales across both segments and solid comparable sales.
- Dividend payout and strong liquidity boost Canadian National (CNI).
- The Children’s Place (PLCE) second-quarter fiscal 2023 results reflect a year-over-year decline in both top and bottom lines due to a tough macroeconomic environment.
( You can access the complete report for August 22–28, 2023, by clicking on this link )
Some potential caveats to consider:
- We are uncertain about how the weights assigned to the news are calculated, and whether the previous summary rapidly loses relevance and is replaced by the latest news chunk summary.
- It remains unclear whether more recent news items carry more weight, or if the quantity of news plays a significant role in determining their weight.
- The output appears to be somewhat disorderly, occasionally covering small stocks without a discernible logical order in the output
I wouldn’t confidently assert that this is the most accurate summary. I would greatly appreciate hearing your thoughts on potential improvements in the summarisation process and any insights gained from your experience using OpenAI’s ChatGPT.
In the realm of stock markets and financial news, we offer an innovative approach for time-pressed investors. Our data-driven strategy spotlights stocks with significant news coverage, over 10 stories weekly, aiding investors in identifying crucial events and trends. Real-time market sentiment analysis and predictive insights empower readers to make informed investment decisions. We also explore ‘star’ companies with high media attention, uncovering hidden growth opportunities.
Our high-level plan involves account setup, Google Drive integration, news retrieval and analysis, and the OpenAI ChatGPT API. We’ve factored in pricing and token limitations to ensure reliability.
Our algorithm prioritises stocks with ample news coverage while considering market-wide indicators. We strike a balance between comprehensiveness and efficiency to deliver concise yet informative summaries.
We’ve refined our approach using GPT-3.5-Turbo-16k and GPT-4–8k models, using the latter selectively due to cost and processing time.
While we don’t claim perfection, we welcome feedback to enhance our summarisation process and empower investors in the dynamic world of finance. For details on August 22–28, 2023, follow the provided link.