Learn to scrape news headlines and article content ethically using Python. Avoid blocks, handle JavaScript, and extract clean data – step‑by‑step with working code.
Scraping isn’t magic – it’s just automated browsing. But the web has rules, and breaking them can get your IP banned or worse.
The number one mistake is using a default Python requests call with no headers. The target site sees a bot and slams the door. I’ve seen this happen to dozens of new coders. The fix is simple: make your scraper look like a real browser.
I am not a lawyer, but I have consulted with one for a commercial scraping project. Here’s the practical summary:
Scraping public, non‑login content for personal or research use is generally fine.
Ignoring robots.txt or bypassing paywalls is not fine.
Republishing full articles without permission is copyright infringement.
GDPR and CCPA apply if you scrape personal data (like commenter names).
When in doubt, ask yourself: “Would I be okay if someone did this to my website?” If the answer is no, don’t do it.

You don’t need a powerful server. A basic laptop with Python installed is enough for thousands of articles. Here’s my personal toolkit:
Python 3.9+ (I use 3.11)
Requests – for fetching pages
BeautifulSoup4 – for parsing HTML
Pandas – for saving data to CSV/JSON
Playwright – for JavaScript‑heavy sites
Install everything with:
1pip install requests beautifulsoup4 pandas playwright
2playwright install
3
4Pick a site that is:
Public and permissive – check site.com/robots.txt for Disallow: / (if present, don’t scrape).
Static HTML – easier for your first try. BBC’s technology section or Reuters’ public pages are good.
Not behind a login – avoid paywalled sites like The Wall Street Journal.
Open your target news page in Chrome. Right‑click a headline and choose “Inspect”. The Elements panel shows you the HTML structure.
Look for patterns:
Headlines inside h2 or h3 tags, often with a class like title, headline, or story-heading.
Links inside "a" tags right next to the headline.
Dates inside "time" elements.
Write down the CSS selectors you see. This is your blueprint.
Here’s a complete, working example that scrapes headlines from a public news listing page. I’ve used this pattern for dozens of projects:
1import requests
2from bs4 import BeautifulSoup
3import time
4import pandas as pd
5
6url = "https://www.reuters.com/technology/"
7headers = {
8 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
9 "Accept-Language": "en-US,en;q=0.9",
10 "Referer": "https://www.google.com/"
11}
12
13response = requests.get(url, headers=headers)
14if response.status_code != 200:
15 print(f"Failed: {response.status_code}")
16 exit()
17
18soup = BeautifulSoup(response.text, "html.parser")
19articles = []
20
21for article in soup.find_all("article"):
22 title_tag = article.find("h3") or article.find("h2")
23 if not title_tag:
24 continue
25 title = title_tag.get_text(strip=True)
26 link_tag = article.find("a")
27 link = link_tag.get("href") if link_tag else ""
28 if link and not link.startswith("http"):
29 link = "https://www.reuters.com" + link
30 articles.append({"title": title, "url": link})
31 time.sleep(0.5) # pause between items
32
33df = pd.DataFrame(articles)
34df.to_csv("news_headlines.csv", index=False)
35print(f"Scraped {len(articles)} articles")
36
37Run this script. You’ll get a CSV file with headlines and URLs. That’s your first successful scrape.
Most news sites list articles across multiple pages. Look for a “Next” button or a URL pattern like ?page=2. Here’s a safe pagination loop:
1base_url = "https://example-news.com/page/{}"
2for page_num in range(1, 6): # first 5 pages
3 url = base_url.format(page_num)
4 # ... scraping code from above ...
5 time.sleep(2) # critical delay between pages
6
7Why 2 seconds? Because a human clicks every few seconds, not 50 times per second. I’ve kept my IP safe for years by simply being slow
You request a page with requests, but the headline you see in your browser isn’t in the response text. Open “View Page Source” (Ctrl+U) and search for text you know is on screen. If it’s missing, JavaScript is loading the content.
Many modern news sites fetch articles via a JSON API. Here’s how to find it:
Open Developer Tools (F12) → Network tab.
Reload the page.
Filter by Fetch/XHR.
Look for a request with articles, items, feed, or news in its name.
Click it and check the Preview tab – if you see structured JSON with headlines, you’ve struck gold.
Once you have the API endpoint, you can scrape it directly without parsing HTML:
1import requests
2api_url = "https://example-news.com/api/v1/latest?limit=20"
3data = requests.get(api_url).json()
4for item in data["articles"]:
5 print(item["title"], item["url"])
6
7This method is faster, more reliable, and less likely to get you blocked.
If the hidden API is too complex or doesn’t exist, use a headless browser. I prefer Playwright over Selenium because it’s faster and handles modern JavaScript better.
1from playwright.sync_api import sync_playwright
2
3with sync_playwright() as p:
4 browser = p.chromium.launch(headless=True)
5 page = browser.new_page()
6 page.goto("https://example-news.com/js-section")
7 page.wait_for_selector("article", timeout=10000)
8 html = page.content()
9 browser.close()
10
11# Now parse `html` with BeautifulSoup as before
12
13Playwright is slower, but it can scrape anything a human can see – including infinite scroll pages.
From my own trial and error, these three things prevent 90% of blocks:
Rotate User‑Agents – Keep a list of real browser strings and pick one randomly per request.
Use a session with cookies – requests.Session() makes you look like a returning visitor.
Add random delays – time.sleep(random.uniform(1, 3)) between requests.
If you’re scraping more than a few thousand pages per day, your home IP will eventually get rate‑limited. Here’s my rule:
Under 1,000 requests/day – no proxies needed, just be polite.
1,000–10,000/day – use a rotating proxy service (I like Bright Data or ScraperAPI).
Over 10,000/day – consider switching to a news API instead of scraping.
Don’t panic, and don’t keep hammering. Here’s a step‑by‑step recovery:
Stop all requests immediately.
Wait at least one hour (or a full day for CAPTCHAs).
Increase delays to 5–10 seconds.
Change your IP (restart your home router or switch to a proxy).
If the site still blocks you, respect it – they don’t want to be scraped.
Raw HTML from news sites is messy. Use BeautifulSoup to strip out unwanted elements before extracting text:
1# Remove script, style, ad containers
2for unwanted in soup.select("script, style, .advertisement, .comments-section"):
3 unwanted.decompose()
4
5# Get clean body text
6clean_text = soup.get_text(separator=" ", strip=True)
7
8News sites use wild date formats: "2 hours ago", "Mar 5, 2025", "2025-03-05T14:30:00Z". The dateutil parser handles almost everything:
1from dateutil import parser
2raw_date = "2 hours ago"
3parsed = parser.parse(raw_date, fuzzy=True) # returns datetime object
4
5For small to medium projects, CSV is fine. For larger datasets (tens of thousands of articles), use SQLite:
1import sqlite3
2conn = sqlite3.connect("news_articles.db")
3df.to_sql("articles", conn, if_exists="replace", index=False)
4
5Last year, I needed to track how three news outlets covered a major tech event over two weeks. I scraped 10,000 headlines and summaries. Here’s exactly what I did:
Targets – BBC, Reuters, and The Guardian (all allow scraping in their robots.txt).
Method – Used RSS feeds for BBC and Reuters (fast and legal).
For The Guardian – Used their official open API (free tier).
Avoided scraping completely – because APIs are always better when available.
Outcome – Clean dataset within hours, no blocks, happy client.
The lesson: always check for APIs and RSS first. Scraping is your backup, not your first choice.
Include a contact email in your User‑Agent string. Example: "ResearchBot/1.0 (contact: [email protected])". This shows you’re a human who can be reached if there’s a problem.
Fetch the file: https://targetsite.com/robots.txt. Look for lines like:
1User-agent: *
2Disallow: /search/
3Disallow: /comments/
4
5Don’t scrape any path listed under Disallow. Some sites also specify a crawl delay – honour it.
For transparency, log the date, target site, number of requests, and your IP range. If a site owner contacts you, you can respond professionally.
If a site blocks you after you’ve tried polite delays and proxies, take the hint. Some publishers offer paid APIs or data feeds. Buying the data is often cheaper than fighting a legal battle or constantly rewriting broken scrapers.
The site changed its HTML. This happens all the time. Update your CSS selectors by inspecting the page again.
The server checks for JavaScript capabilities or specific headers. Use Playwright instead of requests.
Some sites load article text via lazy loading. You may need to scroll the page in Playwright before extracting content.
1page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
2time.sleep(2)
3
4Scraping news articles is a legitimate technical skill when done responsibly. It powers everything from personal morning briefings to academic research to market intelligence. The difference between a good scraper and a bad one comes down to respect – respect for the website, respect for the data, and respect for the law.
Start small. Practice on a site that allows scraping. Write clean, polite code. And always ask yourself: “Would I want someone doing this to my site?”