Python Web Scraping Tutorial: Build a Web Scraper with BeautifulSoup and Requests (Step-by-Step)

If you’ve ever copied data from a website “just this once,” you’ve already felt the itch. What if your script could do that for you—accurately, repeatably, and in minutes? In this guide, you’ll learn how to build a beginner-friendly, reliable web scraper in Python using BeautifulSoup and the requests library. We’ll walk through setup, parsing, pagination, saving data, and how to scrape responsibly so you don’t get blocked (or break rules).

By the end, you’ll have a working scraper you can adapt to real projects like tracking prices, aggregating job listings, or collecting research data. And yes—we’ll cover the ethics and security considerations too. Let’s get you scraping the right way.

What Is Web Scraping (and When Should You Use It)?

Web scraping is the automated extraction of data from web pages. Instead of manually copying text, a scraper requests a page, reads the HTML, and pulls out the parts you care about (like titles, prices, or links).

Before you start, keep this in mind: – If the site offers an API, use it—it’s usually faster, cleaner, and safer. – Always check the site’s robots.txt and Terms of Service. – Scrape responsibly with rate limits and proper identification.

Here’s why that matters: scraping without consent or care can lead to blocked IPs, legal issues, or inaccurate data. We’ll do it right.

Helpful resources: – Robots.txt basics: Google Search Central – BeautifulSoup docs: bs4 documentation – requests docs: Requests: HTTP for Humans

Tools We’ll Use (and Why)

Python 3.10+ (any recent version works)
requests for making HTTP requests
BeautifulSoup (bs4) for parsing HTML
A test website designed for practice: quotes.toscrape.com

Install dependencies:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install requests beautifulsoup4

If you’re new to virtual environments, this is a short read: Python venv docs.

Quick Win: Your First 15-Line Web Scraper

Let’s get a fast result to build confidence. We’ll scrape quotes from the first page of Quotes to Scrape—a site built for practice.

import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com/"
res = requests.get(url, timeout=10)
res.raise_for_status()  # fail if the request didn't succeed

soup = BeautifulSoup(res.text, "html.parser")

quotes = []
for q in soup.select(".quote"):
    text = q.select_one(".text").get_text(strip=True)
    author = q.select_one(".author").get_text(strip=True)
    tags = [t.get_text(strip=True) for t in q.select(".tags .tag")]
    quotes.append({"text": text, "author": author, "tags": tags})

for q in quotes[:3]:
    print(q)

You should see a few quotes printed with authors and tags. Small script, big smile.

Now, let’s level up from a simple proof-of-concept to a robust scraper.

Step-by-Step: Build a Reliable Python Web Scraper

1) Choose a Target Site Safely

Before scraping any site: – Read the site’s Terms of Service. – Check robots.txt to see what’s allowed: e.g., http://quotes.toscrape.com/robots.txt. – Avoid logging in and scraping behind authentication without explicit permission. – Respect privacy and avoid personal data.

Here’s why: you want to reduce risk, avoid getting blocked, and keep your project aligned with ethical and legal norms.

2) Inspect the HTML You’ll Parse

Open your browser’s DevTools (Right-click → Inspect). Hover over elements to find: – CSS classes you can select, like .quote or .author – Links to “Next” pages (useful for pagination) – Any unusual patterns or empty placeholders (common in JS-heavy sites)

Tip: BeautifulSoup supports CSS selectors via soup.select("css") and soup.select_one("css"). More in the bs4 docs.

3) Send HTTP Requests the Right Way

We’ll use a session with retries, a clear User-Agent, and timeouts. This improves reliability and communicates who you are.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def make_session():
    session = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "OPTIONS"]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    session.headers.update({
        "User-Agent": "ExampleScraper/1.0 (+your-email@example.com)"
    })
    return session

session = make_session()
resp = session.get("http://quotes.toscrape.com/", timeout=10)
resp.raise_for_status()

Why this matters: – A descriptive User-Agent shows good intent. – Retries help recover from temporary server issues. – Timeouts prevent your script from hanging.

4) Parse HTML with BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")
for box in soup.select(".quote"):
    quote = box.select_one(".text").get_text(strip=True)
    author = box.select_one(".author").get_text(strip=True)
    print(quote, "-", author)

Best practices: – Use CSS selectors for readability. – Call .get_text(strip=True) to clean whitespace. – Check for missing elements using conditionals.

5) Handle Pagination

Most useful datasets span multiple pages. Identify the “Next” link and loop until it’s gone.

import time
import random
from bs4 import BeautifulSoup

base_url = "http://quotes.toscrape.com"
url = base_url

all_quotes = []

while url:
    resp = session.get(url, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    for q in soup.select(".quote"):
        all_quotes.append({
            "text": q.select_one(".text").get_text(strip=True),
            "author": q.select_one(".author").get_text(strip=True),
            "tags": [t.get_text(strip=True) for t in q.select(".tags .tag")]
        })

    next_link = soup.select_one("li.next > a")
    url = base_url + next_link["href"] if next_link else None

    # Polite delay (randomized)
    time.sleep(random.uniform(1.0, 2.5))

print(f"Scraped {len(all_quotes)} quotes.")

Pro tip: Don’t scrape too fast. Randomized delays reduce load and avoid tripping rate limits.

6) Save Data to CSV or JSON

Choose a format based on your next step. CSV is great for spreadsheets; JSON is ideal for nested data.

import csv
import json

# CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["text", "author", "tags"])
    writer.writeheader()
    for row in all_quotes:
        row_copy = row.copy()
        row_copy["tags"] = "|".join(row["tags"])
        writer.writerow(row_copy)

# JSON
with open("quotes.json", "w", encoding="utf-8") as f:
    json.dump(all_quotes, f, ensure_ascii=False, indent=2)

Docs for formats: – CSV: csv module – JSON: json module

Be a Good Web Citizen: Safety, Ethics, and Security

Scraping isn’t just code—it’s conduct. Here’s the checklist I recommend:

Check robots.txt and follow it
Example: quotes.toscrape.com/robots.txt
Learn the rules: Google robots.txt guide
Respect rate limits
Add time.sleep between requests.
Back off on HTTP 429 (Too Many Requests).
Identify yourself
Use a descriptive User-Agent with contact info.
Avoid personal or sensitive data
Don’t collect PII without consent.
If you must store sensitive data, encrypt it and restrict access (and confirm legal compliance like GDPR/CCPA if applicable).
Don’t bypass technical protections
If a site actively blocks scraping, seek permission or use an official API.
Keep your secrets secret
Never hardcode credentials or API keys in your script.
Use environment variables if needed.

Here’s a simple way to programmatically check robots.txt permissions using Python’s robotparser:

import urllib.robotparser as rp

robots_url = "http://quotes.toscrape.com/robots.txt"
r = rp.RobotFileParser()
r.set_url(robots_url)
r.read()

allowed = r.can_fetch("*", "/")
print("Allowed to scrape root path:", allowed)

Note: robots.txt is advisory, not law—but respecting it is a best practice.

Common Scraper Errors (and How to Fix Them)

HTTP 403 Forbidden
Add a legitimate User-Agent.
Reduce request rate.
Check if the content is behind authentication or uses anti-bot protections.
HTTP 404 Not Found
Double-check URLs.
Some sites change structure per session or locale.
HTTP 429 Too Many Requests
Slow down. Increase delay and add exponential backoff.
Empty or incomplete data
The site might be loading content with JavaScript.
Consider if there’s an API. If not, tools like Selenium can render JS—but only use them if permitted.
Different HTML between browser and script
Some servers vary content by headers. Try sending Accept-Language and Accept headers along with User-Agent.
Verify cookies or session use is necessary.
Connection errors and timeouts
Set reasonable timeouts.
Use retries for intermittent network issues.

Full Example: A Minimal, Robust Scraper

This script scrapes every quote and author from the site, handles pagination, uses retries, sets headers, and saves to JSON.

import time
import random
import json
from typing import List, Dict
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

BASE_URL = "http://quotes.toscrape.com"

def make_session() -> requests.Session:
    s = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=0.8,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET"]
    )
    adapter = HTTPAdapter(max_retries=retry)
    s.mount("http://", adapter)
    s.mount("https://", adapter)
    s.headers.update({
        "User-Agent": "QuotesScraper/1.0 (+you@example.com)"
    })
    return s

def parse_quotes(html: str) -> List[Dict]:
    soup = BeautifulSoup(html, "html.parser")
    results = []
    for q in soup.select(".quote"):
        text_el = q.select_one(".text")
        author_el = q.select_one(".author")
        if not text_el or not author_el:
            continue
        tags = [t.get_text(strip=True) for t in q.select(".tags .tag")]
        results.append({
            "text": text_el.get_text(strip=True),
            "author": author_el.get_text(strip=True),
            "tags": tags
        })
    return results

def find_next_url(html: str) -> str | None:
    soup = BeautifulSoup(html, "html.parser")
    next_link = soup.select_one("li.next > a")
    if not next_link:
        return None
    href = next_link.get("href")
    if not href:
        return None
    return BASE_URL + href

def crawl_all_quotes() -> List[Dict]:
    session = make_session()
    url = BASE_URL
    data: List[Dict] = []

    while url:
        resp = session.get(url, timeout=10)
        resp.raise_for_status()
        page_quotes = parse_quotes(resp.text)
        data.extend(page_quotes)
        url = find_next_url(resp.text)

        # polite randomized pause
        time.sleep(random.uniform(0.8, 2.2))
    return data

if __name__ == "__main__":
    quotes = crawl_all_quotes()
    print(f"Scraped {len(quotes)} quotes.")
    with open("quotes.json", "w", encoding="utf-8") as f:
        json.dump(quotes, f, ensure_ascii=False, indent=2)

Adapting this to your own project: – Update BASE_URL. – Adjust selectors in parse_quotes. – Expand fields you collect. – Add CSV export if needed.

What About JavaScript-Heavy Sites?

Some sites render key content via JavaScript. Requests + BeautifulSoup parse the HTML returned by the server; they don’t execute JS. You have options:

Look for an API call in the network tab. Many sites fetch JSON behind the scenes.
If scraping is permitted but requires rendering, consider:
Playwright or Selenium for controlled browser automation.
A framework like Scrapy for large-scale crawling: Scrapy docs

Always verify you’re allowed to do this. If a site works hard to block automation, stop and seek permission or use official endpoints.

Performance Tips (When You’re Ready)

Use caching to avoid re-downloading pages during development: requests-cache
Parallelization can help—but use with care. Respect the site’s capacity and your own rate limits.
Normalize and clean data as you parse. The less cleanup later, the better.
Log your progress and errors for long runs.
Store computed fingerprints of pages to detect changes over time (useful for monitoring).

Security Considerations for Scrapers

Scrapers can be attacked too. Protect your environment and data:

Validate URLs and inputs to avoid SSRF-like patterns in internal tools.
Set timeouts on every request to prevent hangs.
Leave TLS verification on (it’s on by default in requests).
Sanitize filenames and paths when saving data.
Guard credentials and tokens in environment variables or a secret manager—not in your code.
Minimize storing personal data; if you must, secure it and comply with local laws.

This sounds heavy, but here’s the takeaway: small guardrails now prevent big headaches later.

Where to Go From Here

You’ve built a real scraper with retries, polite delays, pagination, and structured output. To grow your skills: – Read more on parsing and selectors in BeautifulSoup’s docs. – Explore Scrapy for bigger projects: Scrapy docs – Learn advanced HTTP with Requests

And if a site offers an API, use it—it’s often the best long-term approach.

FAQ: Web Scraping with Python

Q: Is web scraping legal?
A: It depends. Public data isn’t automatically free to scrape. Check the site’s Terms of Service, comply with robots.txt, avoid personal data, and follow local laws. When in doubt, ask for permission.

Q: How do I avoid getting blocked?
A: Slow down, use a descriptive User-Agent, add retries with backoff, and follow robots.txt. Don’t hammer endpoints. If you see HTTP 429 or 403, pause and reassess.

Q: BeautifulSoup vs. Scrapy—what should I use?
A: BeautifulSoup + requests is perfect for small, one-off scrapers and learning. Scrapy is a full framework with built-in throttling, pipelines, and scheduling—great for bigger, production-grade crawlers.

Q: What if the site uses JavaScript to load content?
A: Look for the underlying JSON API in the network tab. If none exists and scraping is allowed, use a headless browser with tools like Playwright or Selenium.

Q: How can I respect robots.txt in code?
A: Use urllib.robotparser to check if a path is allowed. Still, treat robots.txt as guidance and adhere to the site’s stated rules and ToS.

Q: How do I store scraped data?
A: Use CSV for tabular data and JSON for nested structures. For larger workloads, consider a database (SQLite, PostgreSQL). Always secure and minimize sensitive data.

Q: Can I rotate user agents and proxies?
A: You can, but don’t use them to evade restrictions. The ethical route is to identify yourself clearly, limit rate, and request permission when needed.

Q: How long should my delay be between requests?
A: There’s no universal number. Start with 1–3 seconds, and adjust based on robots.txt Crawl-delay (if specified) and your impact. Err on the side of being polite.

Final Takeaway

You don’t need fancy tools to scrape well—you need respectful habits and a clear, reliable process. With requests and BeautifulSoup, you can extract real data, the right way. Try adapting the example to your own target site next, and keep exploring best practices.

If you found this helpful, consider bookmarking or subscribing for more hands-on Python and data tutorials.

What Is Web Scraping (and When Should You Use It)?

Tools We’ll Use (and Why)

Quick Win: Your First 15-Line Web Scraper

Step-by-Step: Build a Reliable Python Web Scraper

1) Choose a Target Site Safely

2) Inspect the HTML You’ll Parse

3) Send HTTP Requests the Right Way

4) Parse HTML with BeautifulSoup

5) Handle Pagination

6) Save Data to CSV or JSON

Be a Good Web Citizen: Safety, Ethics, and Security

Common Scraper Errors (and How to Fix Them)

Full Example: A Minimal, Robust Scraper

What About JavaScript-Heavy Sites?

Performance Tips (When You’re Ready)

Security Considerations for Scrapers

Where to Go From Here

FAQ: Web Scraping with Python

Final Takeaway

Read more related articles at Dovydas.io

Automate Everyday Tasks with Python: 11 Practical Scripts That Save Hours Every Week

Leave a Reply Cancel reply

What Is Web Scraping (and When Should You Use It)?

Tools We’ll Use (and Why)

Quick Win: Your First 15-Line Web Scraper

Step-by-Step: Build a Reliable Python Web Scraper

1) Choose a Target Site Safely

2) Inspect the HTML You’ll Parse

3) Send HTTP Requests the Right Way

4) Parse HTML with BeautifulSoup

5) Handle Pagination

6) Save Data to CSV or JSON

Be a Good Web Citizen: Safety, Ethics, and Security

Common Scraper Errors (and How to Fix Them)

Full Example: A Minimal, Robust Scraper

What About JavaScript-Heavy Sites?

Performance Tips (When You’re Ready)

Security Considerations for Scrapers

Where to Go From Here

FAQ: Web Scraping with Python

Final Takeaway

Read more related articles at Dovydas.io

Similar Posts

Leave a Reply Cancel reply