Scraping all images of a web page with Beautiful Soup
Beautiful Soup? What does that even mean?
We all know that the internet is an incredible source of data.If we need this data, it is more reasonable to scrape it instead of getting whole bunch of it.
Beautiful Soup? What does that even mean?
Well, definition coming from the Beautiful Soup documentation.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
So it is a Python library and helps us to get data from HTML&XML fles. You know that XML and HTML are similar on some ways. Both end with ML, when we open these abbreviations, we see that "ML" stands for the markup language. By the help of these markups, Beautiful Soup allows us to quickly get things done that would take a long time. Like getting data from HTML File ! :))
Lets Get Started
So we can start with a quick installation, our packaage name is "beautifulsoup4"
$ pip install beautifulsoup4
Getting All HTML From a Website
We can use the requests module to get all data from a live website. Let's try to get some data from Linus Torvalds' website.
import requests
from bs4 import BeautifulSoup
response=requests.get("http://www.linuxtorvalds.com")
htmlpage=response.text
By doing this, we are getting the whole HTML from this website.
Scraping same type of HTML Elements(img in our case)
Then we are creating a BeautifulSoup object from this html-text data. And passing its second argument as 'html-parser', since it is an HTML page. Then, with the help of the find_all method of the BeautifulSoup object, we are trying to get all "img" elements of this web page.
import requests
from bs4 import BeautifulSoup
response=requests.get("http://www.linuxtorvalds.com")
htmlpage=response.text
data = BeautifulSoup(htmlpage, 'html.parser')
images = data.find_all('img', src=True)
for image in images:
print(image)
Let's see what is the output of printing an image.
<img alt="My name is Linus Torvalds, and not Linux Torvalds" border="1" src="Linux-Torvalds.png"/>
<img alt="Tux the penguin: The mascot of Linux is a cartoon penguin" src="tux-the-penguin.png"/>
Okay, so now we got the alt and the src attributes of these image elements.
Getting the src of images
for image in images:
print(image["src"])
I can reach these values like above, so they are key-value pairs.
After that i can do the following to get only the src values, because i only need these src values.
img_sources=[image['src'] for image in images]
Sending requests to the img sources
Now i have all sources of images. Depending on the source paths (these can be relative/absolute) i'm going to send request with the website name (in my case it is linuxtorvalds.com ) or if the paths are already a URL. Then i will be directly send request to the sources without adding website name in front of it.
The complete code will be like this for my example.
import requests
from bs4 import BeautifulSoup
response=requests.get("http://www.linuxtorvalds.com")
htmlpage=response.text
data = BeautifulSoup(htmlpage, 'html.parser')
images = data.find_all('img', src=True)
for image in images:
print(type(image))
print(image["src"])
image_src = [x['src'] for x in images]
for i,src in enumerate(image_src):
with open('image_'+str(i)+'.jpg', 'wb') as f:
res = requests.get(f"http://www.linuxtorvalds.com/{src}")
print(res.content)
f.write(res.content)