Web Scraper Python Beautifulsoup

Posted By admin On 02.05.21

Web crawling is a powerful technique to collect data from the web by finding all the URLs for one or multiple domains. Python has several popular web crawling libraries and frameworks.

Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code. A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue.
Using BeautifulSoup. BeautifulSoup is widely used due to its simple API and its powerful extraction capabilities. It has many different parser options that allow it to understand even the most poorly written HTML pages – and the default one works great. Compared to libraries that offer similar functionality, it’s a pleasure to use.

Dec 11, 2020 Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code. A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue.

In this article, we will first introduce different crawling strategies and use cases. Then we will build a simple web crawler from scratch in Python using two libraries: requests and Beautiful Soup. Next, we will see why it’s better to use a web crawling framework like Scrapy. Finally, we will build an example crawler with Scrapy to collect film metadata from IMDb and see how Scrapy scales to websites with several million pages.

What is a web crawler?

Web crawling and web scraping are two different but related concepts. Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code.

A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline.

Web crawling strategies

In practice, web crawlers only visit a subset of pages depending on the crawler budget, which can be a maximum number of pages per domain, depth or execution time.

Most popular websites provide a robots.txt file to indicate which areas of the website are disallowed to crawl by each user agent. The opposite of the robots file is the sitemap.xml file, that lists the pages that can be crawled.

Popular web crawler use cases include:

Search engines (Googlebot, Bingbot, Yandex Bot…) collect all the HTML for a significant part of the Web. This data is indexed to make it searchable.
SEO analytics tools on top of collecting the HTML also collect metadata like the response time, response status to detect broken pages and the links between different domains to collect backlinks.
Price monitoring tools crawl e-commerce websites to find product pages and extract metadata, notably the price. Product pages are then periodically revisited.
Common Crawl maintains an open repository of web crawl data. For example, the archive from October 2020 contains 2.71 billion web pages.

Next, we will compare three different strategies for building a web crawler in Python. First, using only standard libraries, then third party libraries for making HTTP requests and parsing HTML and finally, a web crawling framework.

Building a simple web crawler in Python from scratch

To build a simple web crawler in Python we need at least one library to download the HTML from a URL and an HTML parsing library to extract links. Python provides standard libraries urllib for making HTTP requests and html.parser for parsing HTML. An example Python crawler built only with standard libraries can be found on Github.

The standard Python libraries for requests and HTML parsing are not very developer-friendly. Other popular libraries like requests, branded as HTTP for humans, and Beautiful Soup provide a better developer experience. You can install the two libraries locally.

A basic crawler can be built following the previous architecture diagram.

The code above defines a Crawler class with helper methods to download_url using the requests library, get_linked_urls using the Beautiful Soup library and add_url_to_visit to filter URLs. The URLs to visit and the visited URLs are stored in two separate lists. You can run the crawler on your terminal.

The crawler logs one line for each visited URL.

The code is very simple but there are many performance and usability issues to solve before successfully crawling a complete website.

The crawler is slow and supports no parallelism. As can be seen from the timestamps, it takes about one second to crawl each URL. Each time the crawler makes a request it waits for the request to be resolved and no work is done in between.
The download URL logic has no retry mechanism, the URL queue is not a real queue and not very efficient with a high number of URLs.
The link extraction logic doesn’t support standardizing URLs by removing URL query string parameters, doesn’t handle URLs starting with #, doesn’t support filtering URLs by domain or filtering out requests to static files.
The crawler doesn’t identify itself and ignores the robots.txt file.

Next, we will see how Scrapy provides all these functionalities and makes it easy to extend for your custom crawls.

Web crawling with Scrapy

Scrapy is the most popular web scraping and crawling Python framework with 40k stars on Github. One of the advantages of Scrapy is that requests are scheduled and handled asynchronously. This means that Scrapy can send another request before the previous one is completed or do some other work in between. Scrapy can handle many concurrent requests but can also be configured to respect the websites with custom settings, as we’ll see later.

Scrapy has a multi-component architecture. Normally, you will implement at least two different classes: Spider and Pipeline. Web scraping can be thought of as an ETL where you extract data from the web and load it to your own storage. Spiders extract the data and pipelines load it into the storage. Transformation can happen both in spiders and pipelines, but I recommend that you set a custom Scrapy pipeline to transform each item independently of each other. This way, failing to process an item has no effect on other items.

On top of all that, you can add spider and downloader middlewares in between components as it can be seen in the diagram below.

Scrapy Architecture Overview [source]

If you have used Scrapy before, you know that a web scraper is defined as a class that inherits from the base Spider class and implements a parse method to handle each response. If you are new to Scrapy, you can read this article for easy scraping with Scrapy.

Scrapy also provides several generic spider classes: CrawlSpider, XMLFeedSpider, CSVFeedSpider and SitemapSpider. The CrawlSpider class inherits from the base Spider class and provides an extra rules attribute to define how to crawl a website. Each rule uses a LinkExtractor to specify which links are extracted from each page. Next, we will see how to use each one of them by building a crawler for IMDb, the Internet Movie Database.

Building an example Scrapy crawler for IMDb

Before trying to crawl IMDb, I checked IMDb robots.txt file to see which URL paths are allowed. The robots file only disallows 26 paths for all user-agents. Scrapy reads the robots.txt file beforehand and respects it when the ROBOTSTXT_OBEY setting is set to true. This is the case for all projects generated with the Scrapy command startproject.

This command creates a new project with the default Scrapy project folder structure.

Then you can create a spider in scrapy_crawler/spiders/imdb.py with a rule to extract all links.

You can launch the crawler in the terminal.

You will get lots of logs, including one log for each request. Exploring the logs I noticed that even if we set allowed_domains to only crawl web pages under https://www.imdb.com, there were requests to external domains, such as amazon.com.

IMDb redirects from URLs paths under whitelist-offsite and whitelist to external domains. There is an open Scrapy Github issue that shows that external URLs don’t get filtered out when the OffsiteMiddleware is applied before the RedirectMiddleware. To fix this issue, we can configure the link extractor to deny URLs starting with two regular expressions.

Rule and LinkExtractor classes support several arguments to filter out URLs. For example, you can ignore specific URL extensions and reduce the number of duplicate URLs by sorting query strings. If you don’t find a specific argument for your use case you can pass a custom function to process_links in LinkExtractor or process_values in Rule.

For example, IMDb has two different URLs with the same content.

To limit the number of crawled URLs, we can remove all query strings from URLs with the url_query_cleaner function from the w3lib library and use it in process_links.

Now that we have limited the number of requests to process, we can add a parse_item method to extract data from each page and pass it to a pipeline to store it. For example, we can either extract the whole response.text to process it in a different pipeline or select the HTML metadata. To select the HTML metadata in the header tag we can code our own XPATHs but I find it better to use a library, extruct, that extracts all metadata from an HTML page. You can install it with pip install extract.

I set the follow attribute to True so that Scrapy still follows all links from each response even if we provided a custom parse method. I also configured extruct to extract only Open Graph metadata and JSON-LD, a popular method for encoding linked data using JSON in the Web, used by IMDb. You can run the crawler and store items in JSON lines format to a file.

The output file imdb.jl contains one line for each crawled item. For example, the extracted Open Graph metadata for a movie taken from the <meta> tags in the HTML looks like this.

The JSON-LD for a single item is too long to be included in the article, here is a sample of what Scrapy extracts from the <script type='application/ld+json'> tag.

Exploring the logs, I noticed another common issue with crawlers. By sequentially clicking on filters, the crawler generates URLs with the same content, only that the filters were applied in a different order.

Long filter and search URLs is a difficult problem that can be partially solved by limiting the length of URLs with a Scrapy setting, URLLENGTH_LIMIT.

I used IMDb as an example to show the basics of building a web crawler in Python. I didn’t let the crawler run for long as I didn’t have a specific use case for the data. In case you need specific data from IMDb, you can check the IMDb Datasets project that provides a daily export of IMDb data and IMDbPY, a Python package for retrieving and managing the data.

Web crawling at scale

If you attempt to crawl a big website like IMDb, with over 45M pages based on Google, it’s important to crawl responsibly by configuring the following settings. You can identify your crawler and provide contact details in the BOT_NAME setting. To limit the pressure you put on the website servers you can increase the DOWNLOAD_DELAY, limit the CONCURRENT_REQUESTS_PER_DOMAIN or set AUTOTHROTTLE_ENABLED that will adapt those settings dynamically based on the response times from the server.

Notice that Scrapy crawls are optimized for a single domain by default. If you are crawling multiple domains check these settings to optimize for broad crawls, including changing the default crawl order from depth-first to breath-first. To limit your crawl budget, you can limit the number of requests with the CLOSESPIDER_PAGECOUNT setting of the close spider extension.

With the default settings, Scrapy crawls about 600 pages per minute for a website like IMDb. To crawl 45M pages it will take more than 50 days for a single robot. If you need to crawl multiple websites it can be better to launch separate crawlers for each big website or group of websites. If you are interested in distributed web crawls, you can read how a developer crawled 250M pages with Python in 40 hours using 20 Amazon EC2 machine instances.

In some cases, you may run into websites that require you to execute JavaScript code to render all the HTML. Fail to do so, and you may not collect all links on the website. Because nowadays it’s very common for websites to render content dynamically in the browser I wrote a Scrapy middleware for rendering JavaScript pages using ScrapingBee’s API.

Conclusion

We compared the code of a Python crawler using third-party libraries for downloading URLs and parsing HTML with a crawler built using a popular web crawling framework. Scrapy is a very performant web crawling framework and it’s easy to extend with your custom code. But you need to know all the places where you can hook your own code and the settings for each component.

Configuring Scrapy properly becomes even more important when crawling websites with millions of pages. If you want to learn more about web crawling I suggest that you pick a popular website and try to crawl it. You will definitely run into new issues, which makes the topic fascinating!

Sources

Introduction

Web scraping is a technique employed to extract a large amount of data from websites and format it for use in a variety of applications. Web scraping allows us to automatically extract data and present it in a usable configuration, or process and store the data elsewhere. The data collected can also be part of a pipeline where it is treated as an input for other programs.

In the past, extracting information from a website meant copying the text available on a web page manually. This method is highly inefficient and not scalable. These days, there are some nifty packages in Python that will help us automate the process! In this post, I’ll walk through some use cases for web scraping, highlight the most popular open source packages, and walk through an example project to scrape publicly available data on Github.

Web Scraping Use Cases

Web scraping is a powerful data collection tool when used efficiently. Some examples of areas where web scraping is employed are:

Search: Search engines use web scraping to index websites for them to appear in search results. The better the scraping techniques, the more accurate the results.
Trends: In communication and media, web scraping can be used to track the latest trends and stories since there is not enough manpower to cover every new story or trend. With web scraping, you can achieve more in this field.
Branding: Web scraping also allows communications and marketing teams scrape information about their brand’s online presence. By scraping for reviews about your brand, you can be aware of what people think or feel about your company and tailor outreach and engagement strategies around that information.
Machine Learning: Web scraping is extremely useful in mining data for building and training machine learning models.
Finance: It can be useful to scrape data that might affect movements in the stock market. While some online aggregators exist, building your own collection pool allows you to manage latency and ensure data is being correctly categorized or prioritized.

Tools & Libraries

There are several popular online libraries that provide programmers with the tools to quickly ramp up their own scraper. Some of my favorites include:

Requests – a library to send HTTP requests, which is very popular and easier to use compared to the standard library’s urllib.
BeautifulSoup – a parsing library that uses different parsers to extract data from HTML and XML documents. It has the ability to navigate a parsed document and extract what is required.
Scrapy – a Python framework that was originally designed for web scraping but is increasingly employed to extract data using APIs or as a general purpose web crawler. It can also be used to handle output pipelines. With scrapy, you can create a project with multiple scrapers. It also has a shell mode where you can experiment on its capabilities.
lxml – provides python bindings to a fast html and xml processing library called libxml. Can be used discretely to parse sites but requires more code to work correctly compared to BeautifulSoup. Used internally by the BeautifulSoup parser.
Selenium – a browser automation framework. Useful when parsing data from dynamically changing web pages when the browser needs to be imitated.

Library	Learning curve	Can fetch	Can process	Can run JS	Performance
`requests`	easy	yes	no	no	fast
`BeautifulSoup4`	easy	no	yes	no	normal
`lxml`	medium	no	yes	no	fast
`Selenium`	medium	yes	yes	yes	slow
`Scrapy`	hard	yes	yes	no	normal

Using the `Beautifulsoup` HTML Parser on Github

We’re going to use the BeautifulSoup library to build a simple web scraper for Github. I chose BeautifulSoup because it is a simple library for extracting data from HTML and XML files with a gentle learning curve and relatively little effort required. It provides handy functionality to traverse the DOM tree in an HTML file with helper functions.

Requirements

In this guide, I will expect that you have a Unix or Windows-based machine. You might want to install Kite for smart autocompletions and in-editor documentation while you code. You are also going to need to have the following installed on your machine:

Python 3
BeautifulSoup4 Library

Profiling the Webpage

We first need to decide what information we want to gather. In this case, I’m hoping to fetch a list of a user’s repositories along with their titles, descriptions, and primary programming language. To do this, we will scrape Github to get the details of a user’s repositories. While this information is available through Github’s API, scraping the data ourselves will give us more control over the format and thoroughness of the end data.

Web Scraper Python Beautifulsoup Download

Once that’s done, we’ll profile the website to see where our target information is located and create a plan to retrieve it.

To profile the website, visit the webpage and inspect it to get the layout of the elements.

Let’s visit Guido van Rossum’s Github profile as an example and view his repositories:

The div containing the list of repos From the screenshot above, we can tell that a user’s list of repositories is located in a div called user-repositories-list, so this will be the focus of our scraping. This div contains list items that are the list of repositories.
List item that contains a single repo’s info / relevant info on DOM tree The next part shows us the location of a single list item that contains a single repository’s information. We can also see this section as it appears on the DOM tree.
Location of the repository’s name and link Inside a single list item, there is a href link that contains a repository’s name and link.
Location of repository’s description
Location of repository’s language

Web Scraper Python Beautiful Soup Free

For our simple scraper, we will extract the repo name, description, link, and the programming language.

Scraper Setup

We’ll first set up our virtual environment to isolate our work from the rest of the system, then activate the environment. Type the following commands in your shell or command prompt:
mkdir scraping-example cd scraping-example

If you’re using a Mac, you can use this command to active the virtual environment:
python -m venv venv-scraping

On Windows the virtual environment is activated by the following command:
venv-scrapingScriptsactivate.bat

Finally, install the required packages:
pip install bs4 requests

The first package, requests, will allow us to query websites and receive the websites HTML content as rendered on the browser. It is this HTML content that our scraper will go through and find the information we require.

The second package, BeautifulSoup4, will allow us to go through the HTML content, then locate and extract the information we require. It allows us to search for content by HTML tags, elements, and class names using Python’s inbuilt parser module.

Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance. Start coding faster today.

The Simple Scraper Function

Our function will query the website using requests and return its HTML content.

The next step is to use BeautifulSoup library to go through the HTML and extract the div that we identified contains the list items within a user’s repositories. We will then loop through the list items and extract as much information from them as possible for our use.

You may have noticed how we extracted the programming language. BeautifulSoup does not only allow us to search for information using HTML elements but also using attributes of the HTML elements. This is a simple trick to enhance accuracy when working with programming-related data sets.

That’s it! You have successfully built your Github Repository Scraper and can test it on a bunch of other users’ repositories. You can check out Kite’s Github repository to easily access the code from this post and others from their Python series.

Now that you’ve built this scraper, there are myriad possibilities to enhance and utilize it. For example, this scraper can be modified to send a notification when a user adds a new repository. This would enable you to be aware of a developer’s latest work. (Remember when I mentioned that scraping tools are useful in finance? Maintaining your own scraper and setting up notifications for new data would be very useful in that setting).

Another idea would be to build a browser extension that displays a user’s repositories on hover at any page on Github. The scraper would feed data into an API that serves the extension. This data will be then served and displayed on the extension. You can also build a comparison tool for Github users based on the data you scrape, creating a ranking based on how actively users update their repositories or using keyword detection to find repositories that are relevant to you.

What’s Next?

We covered the basics of web scraping in this post and only touched a few of the many use cases for it. requests and beautifulsoup are powerful and relatively simple tools for web scraping, but you can also check out some of the more advanced libraries I highlighted at the beginning of the post for even more functionality. The next steps would be to build more complex scrapers that could be made of multiple scraping functions from many different sources. There are endless ways these scrapers can be integrated into any project that would benefit from data that’s publicly available on the web. Eventually, you’ll have so many web scraping functions running that you’ll have to start thinking about moving your computation to a home server or the cloud!

This post is a part of Kite’s new series on Python. You can check out the code from this and other posts on our GitHub repository.

Company

Product

Resources

Stay in touch

Scrape Website With Beautifulsoup

Get Kite updates & coding tips

Made with in San Francisco