How to scrape website content with Python

May 8, 2020
Let's keep it going with BeautifulSoup. Here we have a question on how to web scrape text, links, and images from a website using Python.
I want to use python to pull all of the copy/images/links from all the pages from a website into a document, but I have no clue where to start. Currently, I have to copy/paste everything manually and it seems like such a tedious task that should just be automated. Any tips are appreciated!!
This person is right, this process can be automated with Python, and we are going to do it with Beautiful Soup! However, we are not going to store everything in a single file, because that can be hard, and the question does not specify what kind of file (and also because the desired file type is probably Excel or Word, and I'm on linux). So instead, we are going to store the text and links to two separate files, and store the images in a directory where they can be accessed.
As an example website, I'm going to be scraping the New York Times website since I know that it has all three of the elements that we are looking to get. So lets beign the Python script by defining the url and building the Beautiful Soup object from the html we get from the New York Times website.
import requests, bs4 # The NYT is a good example b/c it has a mixture of text, images and links url = '' page = requests.get(url) soup = bs4.BeautifulSoup(page.content, 'lxml')
Let's start with the text. There is alot of text on the New York Times homepage. So, we are going to pull the string content in all of the <p> tags. You could also pull string content from the <h1>,
Negative Chart
, etc... tags if you wanted to get the headlines also.
# You can do this with other text on the page in tags like h1, h2, etc... text = soup.find_all('p') text = list(map(lambda x: x.get_text(), text))
We can do almost the exact same thing with links, but we will get the string content of the href attributes in the <a> tags, and we need to add an extra step. Run the code below and look at some of the web address in links, you will notice that some of the links on the New York Times website are absolute links (absolute links begin with "http") and some are relative links (those beginning with "/"). We need to turn all of them into absolute links so that when they are stored in a file, they will all point to a valid web address.
links = soup.find_all('a') links = list(map(lambda x: x.attrs['href'], links))
To make all links are absolute, we will use the function pad_link.
# If a link is relative, then lets append '' to the front of it def pad_link(link): if len(link) > 3 and link[0:4] == 'http': return link elif len(link) == 0 or link[0] == '/': return '' + link else: return '' + link links = soup.find_all('a') links = list(map(lambda x: x.attrs['href'], links)) # Now all of the links will be absolute links links = list(map(pad_link, links))
Next we will pull the src attribute of every <img> tag in the webpage.
images = soup.find_all('img') images = list(map(lambda x: x.attrs['src'], images))
At this point, images is a list of strings that are the web addreses of images, but we want to get the actual images themselves.
How do you pull an image from a link?
We will do another get request for each image address. This will give us back the bytes for each image. We can then write these bytes to a file so that when we open the file we will see the photo. There are two other things we will need to do to make sure that this works properly, but we will cover that in a minute.
Writing the text, links, and images to files
Let's setup a function to write each one of text, links, and images, and then call each function individually.
The text and links are easy, we are just writing strings to a regluar text file.
def write_file(contents, file_name): f = open(file_name, "w") for x in contents: f.write(x) f.close() write_file(text, 'nyt_text.txt') write_file(links, 'nyt_links.txt')
Moving on to writing the images - the two special things we need to do when writing an image to a file:
(1) We need to make sure to use byte mode when writing (see the "wb" in the write_image function)
(2) The New York Times website seems to have two types of image files (.jpg and .png), so each file must contain the correct file extension
So to write images to files, we will need one function to write an image, one function to determine the file type of each image, and one function to pull the byes of each image link.
def write_image(image, image_name): # Notice the "wb" mode there, this means we will be writing bytes to a file f = open(image_name, "wb") f.write(image) f.close() def file_type(image_url): if 'jpg' in image_url: return '.jpg' else: return '.png' def write_images(image_urls): for i in range(len(image_urls)): image_bytes = requests.get(image_urls[i]).content write_image(image_bytes, "nyt_images/image_" + str(i) + file_type(image_urls[i])) write_images(images)
I hope that this explanation helps with writing parts of a website to a file. The only thing I did not cover was how to get all of these into a single file (Word, Excel, etc...). I'll leave that for a possible future write-up.