Web Scraping with Python or Javascript
- source: https://news.ycombinator.com/item?id=24898016 Oct 27, 2020
- source: https://news.ycombinator.com/item?id=26090243 Feb 10, 2021
- source: https://news.ycombinator.com/item?id=24420120 Sep 09, 2020
Reading throught the comments, My take away is to use Requests/Scrapy when data is mostly static or there is api exposed so fetch data is made easier. Otherwise, to deal with dynamic content scraping, use Selenium/Puppeteer etc. There are some other tools worth to explore too, such as Apify, Helium and puppeteer-extra
scrapy
the tl;dr for all web scraping is to just use scrapy (and scrapyd) - otherwise you end up just writing a poorer implementation of what has already been built
My only recent change is that we no longer use Items and ItemLoaders from scrapy - we’ve replaced it with a custom pipeline of Pydantic schemas and objects
I think I’d still choose Scrapy over JS in this case. While it can be a bit convoluted, for real production stuff I don’t know any better choices.
I have myself deployed a Scrapy web scraper as AWS Lambda function and it has worked quite nicely. Every day for the last year now I guess, it has been scraping some websites to make my life a little easier.
approach
Using a headless browser for scraping is a lot slower and resource intensive than parsing HTML.
- I don’t find this as a concern - in all the scraping I’ve done, the only bottleneck was the intentional throttling/rate limiting, not the speed and resources spent by the headless browser; a small, cheap machine could easily process many, many times more requests than it would be reasonable to crawl.
- starting a scraping project with a headless browser might be excessively expensive if you don’t need the additional features.
I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn’t a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.
I’d say 99% of the time you can get by without a browser.
As someone who has done a good bit of scraping, how a website is designed dictates how I scrape.
If it’s a static website that has consistently structured HTML and is easy to enumerate through all the webpages I’m looking for, then simple python requests code will work.
The less clear case is when to use a headless browser vs reverse engineering JS/server side APIs. Typically, I will do like a 10 minute dive into the client side js and monitor ajax requests to see if it would be super easy to hit some API that returns JSON to get my data. If reverse engineering seems to hairy, then I will just do headless browser.
I have a really strong preference for hitting JSON apis directly because, well, you get JSON! Also you usually get more data then you even knew existed.
Then again, if I was creating a spider to recursively crawl a non-static website, then I think Headless is the path of least resistance. But usually, I’m trying to get data in the HTML, and not the whole document.
One tip I would pass on when trying to scrape data from a website, start by using wget in mirror mode to download the useful pages. It’s much faster to iterate on scraping the data once you have it locally. Also, less likely to accidentally kill the site or attract the attention of the host.
- Or just use scrapy’s caching functionality. Super convenient.
A great web-scraping architecture is the pipeline model similar to 3D rendering pipelines. |Stage 1|: Render and HTML, |Stage 2| Save HTML to disk, |Stage 3|: Parse and translate HTML to whatever output you need; JSON, CSV etc…
It’s great if each of these processes can be invoked separately, so that after the HTML is saved, you don’t need to redownload it, unless the source has changed.
By dividing scraping into; rendering, caching and parsing you save your self a lot of web requests. This also helps prevent the website from triggering IP-blocking, DDOS protection and Rate-limiting.
scaling
“I send a command at a random time between 11pm and 4am to wake up an ec2 instance.”
- if I had to do it I’d try something like the following:
- Set an autoscaling group with your instance template, max instances 1, min instances 0, desired instances 0 (nothing is running).
- Set up a Lambda function that sets the autoscaling group desired instances to 1.
- Link that function to an API Gateway call, give it an auth key, etc.
- From any machine you have, set up your cron with a random sleep and a curl call to the API.
From any machine you have, set up your cron with a random sleep and a curl call to the API.
- You might as well just call the ASG API directly.
- Why use autoscaling and not just launch the instance directly from lambda? The run time is short so there’s no danger of two instances running in parallel
I’d say that really depends on your scale and what you’re doing with the content you scrape.
In my experience with large scale scraping you’re much better off using something like Java where you can more easily have a thread pool with thousands of threads (or better yet, Kotlin coroutines) handling the crawling itself and a *NUM CORES thread pool handling CPU bound tasks like parsing.
scheduling in AWS Lambdas
AWS Lambdas are an easy way to get scheduled scraping jobs running.
I use their Python-based chalice framework (https://github.com/aws/chalice) which allows you to add a decorator to a method for a schedule,
@app.schedule(Rate(30, unit=Rate.MINUTES))
It’s also a breeze to deploy.
chalice deploy
panda
Before jumping into frameworks, if your data is lucky enough to be stored in an html table:
import pandas as pd dfs = pd.read_html(url)
Where ‘dfs’ is an array of dataframes - one item for each html table on the page.
- It reads HTML and returns the tables contained in the HTML as pandas dataframes. It’s a simple way to scrape tabular data from websites.
- Sometimes it’s also helpful to use beautiful soup to isolate the elements you want, feed the text of the elements into StringIO and give that to read_html.
residential VPN
Recaptcha comes to mind.
That said, there are quite a few services which battle these systems for you nowadays (such as scraperapi - not affiliated, not a user). They are not always successful, but they have an advantage of maaany residential proxies (no doubt totally ethically obtained /s, but that’s another story).
browserless.io
The biggest struggle I had while building web scrappers is scaling Selenium. If you need to launch Selenium hundred of thousands times per month, you need a lot of computer power which is really expensive on EC2.
A couple years ago, I discovered browserless.io which does this job for you and it’s amazing. I really don’t know how they made this but it just scales without any limit.
- For browserless.io, the developer behind it talks about the tech stack in this podcast: https://runninginproduction.com/podcast/62-browserless-gives-you-fast-scalable-and-reliable-browser-automation
For more generic web indexing you need to use a browser. You do not index pages served by a server anymore, you index pages rendered by javascript apps in the browser. So as a part of the “fetch” stage I usually let parsing of title and other page metadata to a javascript script running inside the browser (using https://www.browserless.io/) and then as part of the “parse” phase I use cheerio to extract links and such. It is very tempting to do everything in the browser, but architecturally it does not belong there. So you need to find the balance that works best for you.
- I’m the founder of browserless.io, and agree with pretty much everything you’re saying. Our infrastructure actually does procedure for some of our scraping needs: we scrape puppeteer’s GH documentation page to build out our debugger’s autocomplete tool. To do this, we “goto” the page, extract the page’s content, and then hand it off to nodejs libraries for parsing. This has two benefits: it cuts down the time you have the browser open and running, and let’s you “offload” some of that work to your back-end with more sophisticated libraries. You get the best of both worlds with this approach, and it’s one we generally recommend to folks everywhere. Also a great way that we “dogfood” our own product as well :)
- What is the reason you are not just getting page content directly with HTTP request? Is headless browser providing some benefits in your case?
- Yes: often the case is that JS does some kind of data-fetching, API calls, or whatever else to render a full page (single-page apps for instance). With Github being mostly just HTML markup and not needing a JS runtime we could have definitely gone that route. The rationale was that we had a desire to.
OpenFaaS
OpenFaaS has been pretty easy to work with for anything we’ve thrown at it, ranging from web scraping to ML deployment. If anyone here has been on the fence, definitely give it a shot.
Back when we first set it up a few of the defaults were utter garbage (like autoscaling which (thankfully slowly) ramped up or down between MIN and MAX instances based purely on whether you were above or below a qps threshold), but there aren’t all that many features, so reading the whole manual and configuring it like you want is a cinch.
- OpenFaaS is great … But its quite slow compared to something like Nuclio: https://github.com/nuclio/nuclio
- High-Performance Serverless event and data processing platform
Could you give a TL;DR version of how we can use OpenFaaS with Google Cloud Functions? Or is it meant to be deployed to GKE?
OpenFaaS is more-or-less a competitor to Google Cloud Functions or AWS Lambda. None is really quite a subset of the other in terms of features, so you might gain some benefit by using multiple FaaS offerings, but they all occupy the same niche.
You can deploy OpenFaas on any Kubernetes offering, Google Cloud Run, Docker Swarm, etc… It runs on your favorite Docker substrate without much hassle.
javascript
Text are DOM nodes. If I were making a business of this I would automate the shit out of it by:
Gather all text nodes directly
Eliminate all text nodes that only contain white space
Add context. Since text nodes are DOM nodes you can get information about the containing element directly from the node itself.
Hands down walking the DOM will be programmatically faster to write and execute than anything else you can write in any language.
If you’re using JavaScript for scraping, you should go straight to the logical conclusion and run your scraper inside a real browser (potentially headless) - using Puppeteer or Selenium or Playwright.
My current favourite stack for this is Selenium + Python - it lets me write most of my scraper in JavaScript that I run inside of the browser, but having Python to control it means I can really easily write the results to a SQLite database while the scraper is running.
Selenium lets you add random delays between your actions which could help avoid triggering a firewall to block you.
Good practice anyway so you don’t overload the site and find your logs empty or full of gaps.
- Good approach, but advanced Selenium detection goes beyond heuristics. Selenium injects JavaScript into the page to function, and the presence of this is how Selenium is detected.
+1 to using cheerio.js. When I need to write a web scraper, I’ve used Node’s
request
library to get the HTML text and cheerio to extract links and resources for the next stage.I’ve also used cheerio when I want to save a functioning local cache of a webpage since I can have it transform all the various multi-server references for
<img>
,<a>
,<script>
, etc on the page to locally valid URLs and then fetch those URLs.
- the reason to upgrade from cheerio to jsdom is if you want to run scripts. E.g., for client-rendered apps, or apps that pull their data from XHR. Since jsdom implements the script element, and the XHR API, and a bunch of other APIs that pages might use, it can get a lot further in the page lifecycle than just “parse the bytes from the server into an initial DOM tree”.
- Running the [arbitrary] scripts not written by me is what I usually try to avoid and fear when scraping.
Another +1 for cheerio.io
If I recall correctly, what was really helpful about it that I could write whatever code I would need to query and parse the DOM in the browser console and the copy and paste it into a script with almost no changes.
It made it really simple to go from a proof of concept into pipeline for scraping material and feeding it into a database.
Maintainer of jsdom here. jsdom will run the JavaScript on a page, so it can get you pretty far in this regard without a proper browser. It has some definite limitations, most notably that it doesn’t do any layout or handling of client-side redirects, but it allows scraping of most single-page client-side-rendered apps.
Hey everyone, maintainer of the Apify SDK here. As far as we know, it is the most comprehensive open-source scraping library for JavaScript (Node.js).
It gives you tools to work with both HTTP requests and headless browsers, storages to save data without having to fiddle with databases and automatic scaling based on available system resources. We use it every day in our web scraping business, but 90% of the features are available for free in the libary itself.
Try it out and tell us what you think: https://github.com/apify/apify-js
And with Puppeteer (also Playwright) it’s never been easier. Recaptcha solving, Ad blocking etc. in just a few lines of code[1]. https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra
Helium [1]. It extends Selenium’s API and makes it easier to use. The name comes from Helium also being a chemical element, but lighter than Selenium. https://github.com/mherrmann/selenium-python-helium
bot detection
I failed a webscraping project due to strong anti bot detection.
They checked for bots through useragent/screen size, maybe mouse movements, trends in searches (same area code), etc… (Can they really detect me through my internet connection headers, despite proxies?)
It was impossible for me to scrape, they won.
same here.
there are 2 approaches they use that make developing bots very difficult.
they detect device input. if there is no mouse movement, while the website is being loaded, they will consider it’s a bot.
they detect the order of page visiting. A human visitor will not enumerate all paths, instead, they follow certain patterns. This is detectable with their machine learning model.
I really don’t have a solution for #2
- I think the solution is “hybrid” scraping with a human driving the clicks and the scraper passively collecting the data. If you record, you can probably teach AI to emulate.
Public Data Set
Tips
If you use Selenium & Chrome WebDriver you can disable loading images by :
AddUserProfilePreference("profile.default_content_setting_values.images", 2)