How to scrape Yahoo Finance data with Python and Beautiful Soup

May 11, 2020
Day 6 of 101 Days of Python
We are beginning to get on a roll with webscraping. Today there is a question on Reddit about scraping data from Yahoo Finance. I feel like 10% of questions on programming sub-reddits are about pulling data from Yahoo Finance, so I think this one will be useful.
Here is the original question from Reddit:
Help with BS4 web scraping on Yahoo Finance
I’m a very beginner in python and programming and i’m trying to make a telegram bot for some stocks notification. I’m able to get the name and price from yahoo finance but i’m unable get more information from tables such as Top 10 holdings from https://finance.yahoo.com/quote/SMH/holdings?p=SMH. I am only able to get the first td and nothing else. Could someone please help me with some guidance. I would like to get the names and % from the table.
Let's use the url listed in the question, but let's look at two different tables on the page, not just the Top 10 Holdings table. We will first look at the Top 10 Holdings table, then we will look at the Equity Holdings table.
Scraping the Top 10 Holdings Table
Let's first look at the HTML for the table. You will see that the <div> containing the <table> has a react id of 212. So, if we can pull this <table> with Beautiful Soup, then we can iterate through all of the rows and pull the data that we need.
Negative Chart
Good News! Turns out this is really, really easy, and we can do it in just a few lines of code.
So, lets first get the Beautiful Soup, and grab all the divs.
import requestst, bs4 def get_soup(url): page = requests.get(url) soup = bs4.BeautifulSoup(page.content, 'lxml') return soup soup = get_soup('https://finance.yahoo.com/quote/SMH/holdings?p=SMH') divs = soup.find_all('div')
Next, we need a function that will do the following:
(1) grab <div> 212
(2) get all of <tr> tags in <div> 212
(3) grab the <td>'s in each <tr>, and pull the text from each <td>
(4) return all of the pulled text
def get_top_10_holdings(divs): top_10_holdings = [] top_10_div = list(filter(lambda x: find_div(x, '212'), divs)) top_10_trs = top_10_div[0].find_next('table').tbody.find_all('tr') for tr in top_10_trs: top_10_tds = tr.find_all('td') top_10_holdings.append(list(map(lambda x: x.get_text(), top_10_tds))) return top_10_holdings
Run this code, and you will see the results from the Top 10 Holdings table.
>>> top_10 = get_top_10_holdings(divs) >>> pprint.pprint(top_10) [['Taiwan Semiconductor Manufacturing Co Ltd ADR', 'TSM.TW', '12.06%'], ['Intel Corp', 'INTC', '10.68%'], ['NVIDIA Corp', 'NVDA', '7.44%'], ['Broadcom Inc', 'AVGO', '5.03%'], ['ASML Holding NV ADR', 'ASML', '4.97%'], ['Texas Instruments Inc', 'TXN', '4.96%'], ['Qualcomm Inc', 'QCOM', '4.80%'], ['Analog Devices Inc', 'ADI', '4.76%'], ['Advanced Micro Devices Inc', 'AMD', '4.70%'], ['Micron Technology Inc', 'MU', '4.49%']]
The Equity Holdings Table
The Equity Holdings table is slightly different becuase it is not an actual html <table>, it is just a <div> with other <div>s inside of it. So, we will approach this a little differently.
Again, look at the Equity Holdings <div>, and the <div>s inside the Equity Holdings Div. You should notice two things:
(1) The Equity Holdings <div> has a reactid of 117
(2) The <div>s that makeup the rows of the Equity Holdings table each also have their own reactid
So, we can first lookup the Equity Holdings div with its reactid, then we can lookup each of the rows with their own reactid and pull the relevant information.
First lets setup a dictionary to hold the reactid of each row, and a dictionary that will hold the data after we scrape it.
equity_holdings_react_ids = { 'Price/Earnings' : '127', 'Price/Book' : '137', 'PriceSales' : '142', 'PriceCashflow' : '147', 'Median Market Cap' : '152' } equity_holdings = {}
Next, lets get the Equity Holdings <div> with help of a function to find the <div>.
def find_div(soup, react_id): return 'data-reactid' in soup.attrs.keys() and soup.attrs['data-reactid'] == react_id soup = get_soup('https://finance.yahoo.com/quote/SMH/holdings?p=SMH') divs = soup.find_all('div') eh_div = list(filter(lambda x: find_div(x, '117'), divs))
Now, all we need is a function to take each key/value pair from the equity_holdings_react_ids dictionary, pull the corresponding <div>, extract the needed information, and write it to the equity_holdings dictionary.
def add_kvp(soup, key): div = list(filter(lambda x: find_div(x, equity_holdings_react_ids[key]), divs)) title = div[0].children.__next__().children.__next__().children.__next__().contents[0] value = list(div[0].children)[1].get_text() if value != 'N/A': value = float(value) equity_holdings[title] = value for k in equity_holdings_react_ids.keys(): add_kvp(eh_div, k)
If you run this code you will see that we successfully captured the data from the Equity Holdings table.
>>> pprint.pprint(equity_holdings) {'3 Year Earnings Growth': 'N/A', 'Median Market Cap': 'N/A', 'Price/Cashflow': 14.0, 'Price/Earnings': 23.77, 'Price/Sales': 4.72}
I hope you found this helpful!
How-could-Coronavirus-layoffs-impact-Home-Prices-in-the-United-States
How-to-scrape-Yahoo-Finance-data-with-Python-and-Beautiful-Soup
How-to-scrape-website-content-with-Python
How-to-use-Classes-in-Python
How-to-iterate-through-HTML-with-BeautifulSoup
Day-2-of-101-Days-of-Python
Day-1-of-101-Days-of-Python
Homemade-Time-Series-With-OCaml
Grok-Correlation
Efficient-Functional-Programming
Hello-World