Python Web Scraping Tutorial: Build a Web Scraper with BeautifulSoup and Requests (Step-by-Step)
If you’ve ever copied data from a website “just this once,” you’ve already felt the itch. What if your script could do that for you—accurately, repeatably, and in minutes? In this guide, you’ll learn how to build a beginner-friendly, reliable web scraper in Python using BeautifulSoup and the requests library. We’ll walk through setup, parsing, pagination, saving data, and how to scrape responsibly so you don’t get blocked (or break rules).
By the end, you’ll have a working scraper you can adapt to real projects like tracking prices, aggregating job listings, or collecting research data. And yes—we’ll cover the ethics and security considerations too. Let’s get you scraping the right way.
What Is Web Scraping (and When Should You Use It)?
Web scraping is the automated extraction of data from web pages. Instead of manually copying text, a scraper requests a page, reads the HTML, and pulls out the parts you care about (like titles, prices, or links).
Before you start, keep this in mind: – If the site offers an API, use it—it’s usually faster, cleaner, and safer. – Always check the site’s robots.txt and Terms of Service. – Scrape responsibly with rate limits and proper identification.
Here’s why that matters: scraping without consent or care can lead to blocked IPs, legal issues, or inaccurate data. We’ll do it right.
Helpful resources: – Robots.txt basics: Google Search Central – BeautifulSoup docs: bs4 documentation – requests docs: Requests: HTTP for Humans
Tools We’ll Use (and Why)
- Python 3.10+ (any recent version works)
- requests for making HTTP requests
- BeautifulSoup (bs4) for parsing HTML
- A test website designed for practice: quotes.toscrape.com
Install dependencies:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install requests beautifulsoup4
If you’re new to virtual environments, this is a short read: Python venv docs.
Quick Win: Your First 15-Line Web Scraper
Let’s get a fast result to build confidence. We’ll scrape quotes from the first page of Quotes to Scrape—a site built for practice.
import requests
from bs4 import BeautifulSoup
url = "http://quotes.toscrape.com/"
res = requests.get(url, timeout=10)
res.raise_for_status() # fail if the request didn't succeed
soup = BeautifulSoup(res.text, "html.parser")
quotes = []
for q in soup.select(".quote"):
text = q.select_one(".text").get_text(strip=True)
author = q.select_one(".author").get_text(strip=True)
tags = [t.get_text(strip=True) for t in q.select(".tags .tag")]
quotes.append({"text": text, "author": author, "tags": tags})
for q in quotes[:3]:
print(q)
You should see a few quotes printed with authors and tags. Small script, big smile.
Now, let’s level up from a simple proof-of-concept to a robust scraper.
Step-by-Step: Build a Reliable Python Web Scraper
1) Choose a Target Site Safely
Before scraping any site: – Read the site’s Terms of Service. – Check robots.txt to see what’s allowed: e.g., http://quotes.toscrape.com/robots.txt. – Avoid logging in and scraping behind authentication without explicit permission. – Respect privacy and avoid personal data.
Here’s why: you want to reduce risk, avoid getting blocked, and keep your project aligned with ethical and legal norms.
2) Inspect the HTML You’ll Parse
Open your browser’s DevTools (Right-click → Inspect). Hover over elements to find:
– CSS classes you can select, like .quote or .author
– Links to “Next” pages (useful for pagination)
– Any unusual patterns or empty placeholders (common in JS-heavy sites)
Tip: BeautifulSoup supports CSS selectors via soup.select("css") and soup.select_one("css"). More in the bs4 docs.
3) Send HTTP Requests the Right Way
We’ll use a session with retries, a clear User-Agent, and timeouts. This improves reliability and communicates who you are.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def make_session():
session = requests.Session()
retry = Retry(
total=3,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
session.headers.update({
"User-Agent": "ExampleScraper/1.0 (+your-email@example.com)"
})
return session
session = make_session()
resp = session.get("http://quotes.toscrape.com/", timeout=10)
resp.raise_for_status()
Why this matters: – A descriptive User-Agent shows good intent. – Retries help recover from temporary server issues. – Timeouts prevent your script from hanging.
More on Retry options: urllib3 Retry.
4) Parse HTML with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")
for box in soup.select(".quote"):
quote = box.select_one(".text").get_text(strip=True)
author = box.select_one(".author").get_text(strip=True)
print(quote, "-", author)
Best practices:
– Use CSS selectors for readability.
– Call .get_text(strip=True) to clean whitespace.
– Check for missing elements using conditionals.
5) Handle Pagination
Most useful datasets span multiple pages. Identify the “Next” link and loop until it’s gone.
import time
import random
from bs4 import BeautifulSoup
base_url = "http://quotes.toscrape.com"
url = base_url
all_quotes = []
while url:
resp = session.get(url, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
for q in soup.select(".quote"):
all_quotes.append({
"text": q.select_one(".text").get_text(strip=True),
"author": q.select_one(".author").get_text(strip=True),
"tags": [t.get_text(strip=True) for t in q.select(".tags .tag")]
})
next_link = soup.select_one("li.next > a")
url = base_url + next_link["href"] if next_link else None
# Polite delay (randomized)
time.sleep(random.uniform(1.0, 2.5))
print(f"Scraped {len(all_quotes)} quotes.")
Pro tip: Don’t scrape too fast. Randomized delays reduce load and avoid tripping rate limits.
6) Save Data to CSV or JSON
Choose a format based on your next step. CSV is great for spreadsheets; JSON is ideal for nested data.
import csv
import json
# CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["text", "author", "tags"])
writer.writeheader()
for row in all_quotes:
row_copy = row.copy()
row_copy["tags"] = "|".join(row["tags"])
writer.writerow(row_copy)
# JSON
with open("quotes.json", "w", encoding="utf-8") as f:
json.dump(all_quotes, f, ensure_ascii=False, indent=2)
Docs for formats: – CSV: csv module – JSON: json module
Be a Good Web Citizen: Safety, Ethics, and Security
Scraping isn’t just code—it’s conduct. Here’s the checklist I recommend:
- Check robots.txt and follow it
- Example: quotes.toscrape.com/robots.txt
- Learn the rules: Google robots.txt guide
- Respect rate limits
- Add
time.sleepbetween requests. - Back off on HTTP 429 (Too Many Requests).
- Identify yourself
- Use a descriptive User-Agent with contact info.
- Avoid personal or sensitive data
- Don’t collect PII without consent.
- If you must store sensitive data, encrypt it and restrict access (and confirm legal compliance like GDPR/CCPA if applicable).
- Don’t bypass technical protections
- If a site actively blocks scraping, seek permission or use an official API.
- Keep your secrets secret
- Never hardcode credentials or API keys in your script.
- Use environment variables if needed.
Here’s a simple way to programmatically check robots.txt permissions using Python’s robotparser:
import urllib.robotparser as rp
robots_url = "http://quotes.toscrape.com/robots.txt"
r = rp.RobotFileParser()
r.set_url(robots_url)
r.read()
allowed = r.can_fetch("*", "/")
print("Allowed to scrape root path:", allowed)
Note: robots.txt is advisory, not law—but respecting it is a best practice.
Common Scraper Errors (and How to Fix Them)
- HTTP 403 Forbidden
- Add a legitimate User-Agent.
- Reduce request rate.
- Check if the content is behind authentication or uses anti-bot protections.
- HTTP 404 Not Found
- Double-check URLs.
- Some sites change structure per session or locale.
- HTTP 429 Too Many Requests
- Slow down. Increase delay and add exponential backoff.
- Empty or incomplete data
- The site might be loading content with JavaScript.
- Consider if there’s an API. If not, tools like Selenium can render JS—but only use them if permitted.
- Different HTML between browser and script
- Some servers vary content by headers. Try sending Accept-Language and Accept headers along with User-Agent.
- Verify cookies or session use is necessary.
- Connection errors and timeouts
- Set reasonable timeouts.
- Use retries for intermittent network issues.
Full Example: A Minimal, Robust Scraper
This script scrapes every quote and author from the site, handles pagination, uses retries, sets headers, and saves to JSON.
import time
import random
import json
from typing import List, Dict
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
BASE_URL = "http://quotes.toscrape.com"
def make_session() -> requests.Session:
s = requests.Session()
retry = Retry(
total=3,
backoff_factor=0.8,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET"]
)
adapter = HTTPAdapter(max_retries=retry)
s.mount("http://", adapter)
s.mount("https://", adapter)
s.headers.update({
"User-Agent": "QuotesScraper/1.0 (+you@example.com)"
})
return s
def parse_quotes(html: str) -> List[Dict]:
soup = BeautifulSoup(html, "html.parser")
results = []
for q in soup.select(".quote"):
text_el = q.select_one(".text")
author_el = q.select_one(".author")
if not text_el or not author_el:
continue
tags = [t.get_text(strip=True) for t in q.select(".tags .tag")]
results.append({
"text": text_el.get_text(strip=True),
"author": author_el.get_text(strip=True),
"tags": tags
})
return results
def find_next_url(html: str) -> str | None:
soup = BeautifulSoup(html, "html.parser")
next_link = soup.select_one("li.next > a")
if not next_link:
return None
href = next_link.get("href")
if not href:
return None
return BASE_URL + href
def crawl_all_quotes() -> List[Dict]:
session = make_session()
url = BASE_URL
data: List[Dict] = []
while url:
resp = session.get(url, timeout=10)
resp.raise_for_status()
page_quotes = parse_quotes(resp.text)
data.extend(page_quotes)
url = find_next_url(resp.text)
# polite randomized pause
time.sleep(random.uniform(0.8, 2.2))
return data
if __name__ == "__main__":
quotes = crawl_all_quotes()
print(f"Scraped {len(quotes)} quotes.")
with open("quotes.json", "w", encoding="utf-8") as f:
json.dump(quotes, f, ensure_ascii=False, indent=2)
Adapting this to your own project:
– Update BASE_URL.
– Adjust selectors in parse_quotes.
– Expand fields you collect.
– Add CSV export if needed.
What About JavaScript-Heavy Sites?
Some sites render key content via JavaScript. Requests + BeautifulSoup parse the HTML returned by the server; they don’t execute JS. You have options:
- Look for an API call in the network tab. Many sites fetch JSON behind the scenes.
- If scraping is permitted but requires rendering, consider:
- Playwright or Selenium for controlled browser automation.
- A framework like Scrapy for large-scale crawling: Scrapy docs
Always verify you’re allowed to do this. If a site works hard to block automation, stop and seek permission or use official endpoints.
Performance Tips (When You’re Ready)
- Use caching to avoid re-downloading pages during development: requests-cache
- Parallelization can help—but use with care. Respect the site’s capacity and your own rate limits.
- Normalize and clean data as you parse. The less cleanup later, the better.
- Log your progress and errors for long runs.
- Store computed fingerprints of pages to detect changes over time (useful for monitoring).
Security Considerations for Scrapers
Scrapers can be attacked too. Protect your environment and data:
- Validate URLs and inputs to avoid SSRF-like patterns in internal tools.
- Set timeouts on every request to prevent hangs.
- Leave TLS verification on (it’s on by default in requests).
- Sanitize filenames and paths when saving data.
- Guard credentials and tokens in environment variables or a secret manager—not in your code.
- Minimize storing personal data; if you must, secure it and comply with local laws.
This sounds heavy, but here’s the takeaway: small guardrails now prevent big headaches later.
Where to Go From Here
You’ve built a real scraper with retries, polite delays, pagination, and structured output. To grow your skills: – Read more on parsing and selectors in BeautifulSoup’s docs. – Explore Scrapy for bigger projects: Scrapy docs – Learn advanced HTTP with Requests
And if a site offers an API, use it—it’s often the best long-term approach.
FAQ: Web Scraping with Python
Q: Is web scraping legal?
A: It depends. Public data isn’t automatically free to scrape. Check the site’s Terms of Service, comply with robots.txt, avoid personal data, and follow local laws. When in doubt, ask for permission.
Q: How do I avoid getting blocked?
A: Slow down, use a descriptive User-Agent, add retries with backoff, and follow robots.txt. Don’t hammer endpoints. If you see HTTP 429 or 403, pause and reassess.
Q: BeautifulSoup vs. Scrapy—what should I use?
A: BeautifulSoup + requests is perfect for small, one-off scrapers and learning. Scrapy is a full framework with built-in throttling, pipelines, and scheduling—great for bigger, production-grade crawlers.
Q: What if the site uses JavaScript to load content?
A: Look for the underlying JSON API in the network tab. If none exists and scraping is allowed, use a headless browser with tools like Playwright or Selenium.
Q: How can I respect robots.txt in code?
A: Use urllib.robotparser to check if a path is allowed. Still, treat robots.txt as guidance and adhere to the site’s stated rules and ToS.
Q: How do I store scraped data?
A: Use CSV for tabular data and JSON for nested structures. For larger workloads, consider a database (SQLite, PostgreSQL). Always secure and minimize sensitive data.
Q: Can I rotate user agents and proxies?
A: You can, but don’t use them to evade restrictions. The ethical route is to identify yourself clearly, limit rate, and request permission when needed.
Q: How long should my delay be between requests?
A: There’s no universal number. Start with 1–3 seconds, and adjust based on robots.txt Crawl-delay (if specified) and your impact. Err on the side of being polite.
Final Takeaway
You don’t need fancy tools to scrape well—you need respectful habits and a clear, reliable process. With requests and BeautifulSoup, you can extract real data, the right way. Try adapting the example to your own target site next, and keep exploring best practices.
If you found this helpful, consider bookmarking or subscribing for more hands-on Python and data tutorials.