Learn how to scrape JavaScript webpages with a smart proxy
In this post, we will introduce how to use the Zyte Smart proxy with Splash which is integrated with the Scrapy web scraping framework. We will learn how to set up the Zyte smart proxy, how to use it with Splash, and how to set up a headless proxy tool for more efficient usage of the smart proxy. Hopefully, this post can help you solve some issues in your system.
Create a scraping project and a spider
First, we need to create a virtual environment and install the libraries needed for this post:
conda create -n scrapy python=3.11
conda activate scrapy
pip install Scrapy==2.9.0
pip install scrapy-splash==0.9.0
pip install scrapy-zyte-smartproxy==2.2.0
Then we can create a scraping project and also a spider in it:
scrapy startproject scraping_proj
cd scraping_proj
scrapy genspider httpbin httpbin.org
We will create a super simple spider that just scrapes https://httpbin.org/ip and return the IP. This simple spider can demonstrate if the proxy is used successfully or not.
import scrapy
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["httpbin.org"]
start_urls = ["https://httpbin.org/ip"]
def parse(self, response):
print(response.text)
If we run the spider directly, we can see our local IP is being used:
$ scrapy crawl httpbin -L WARNING
{
"origin": "94.xxx.xxx.20"
}
Use Zyte smart proxy directly
If we are not scraping JavaScript-created websites, we can use the Zyte smart proxy directly, which is also the most common case.
We need to modify settings.py and add the following configurations to it.
# Zyte Smart Proxy Manager configuration
DOWNLOADER_MIDDLEWARES = {'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = '319xxxxxxxxxxxxxxxxxxxxxxxxxx3b3'
The Zyte API Key can be obtained from the dashboard:
Now if we run the spider, we can see that the IP of the proxy is returned:
$ scrapy crawl httpbin -L WARNING
{
"origin": "173.xxx.xxx.154"
}
Use Zyte with Scrapy-Splash
Splash can be integrated with Scrapy with the scrapy-splash plugin which makes it easier to use Splash in Scrapy.
Let’s first start a Splash server locally using Docker:
docker run -it -p 8050:8050 scrapinghub/splash:3.5
Then we need to add Splash-specific configurations in settings.py:
# Splash settings.
DOWNLOADER_MIDDLEWARES.update({
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
})
SPLASH_URL = 'http://127.0.0.1:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
Since the Lua script used would be pretty long, it’s better to put it in a specific file with the .lua
extension to have better syntax highlight, otherwise, it’s very difficult to read the code in a big string.
Let’s create a new folder scripts in the project folder and create a Lua script named zyte_splash_manager.lua in it, which will have the content as follows:
function use_crawlera(splash)
local user = splash.args.zyte_apikey
local password = ''
local host = 'proxy.zyte.com'
-- Note 8010 is used here. 8010 is more stable than 8011 for Splash!
local port = 8010
local session_header = 'X-Crawlera-Session'
local session_id = 'create'
splash:on_request(function (request)
request:set_proxy(host, port, user, password)
request:set_header('X-Crawlera-Profile', 'desktop')
request:set_header('X-Crawlera-Cookies', 'disable')
request:set_header(session_header, session_id)
end)
splash:on_response_headers(function (response)
if type(response.headers[session_header]) ~= nil then
session_id = response.headers[session_header]
end
end)
end
function main(splash)
use_crawlera(splash)
splash:go(splash.args.url)
return splash:html()
end
Finally, we can update our spider to read the Lua script and use it to make the request:
import scrapy
from pkgutil import get_data
from scrapy_splash import SplashRequest
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["httpbin.org"]
start_urls = ["https://httpbin.org/ip"]
# Note we need to disable the smart proxy for the spider.
zyte_smartproxy_enabled = False
def __init__(self, *args, **kwargs):
self.lua_source = get_data(
"scraping_proj", "scripts/zyte_splash_manager.lua"
).decode("utf-8")
super().__init__(*args, **kwargs)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint="execute",
args={
"lua_source": self.lua_source,
"zyte_apikey": self.settings["ZYTE_SMARTPROXY_APIKEY"],
},
cache_args=['lua_source'],
meta={
'max_retry_times': 10,
}
)
def parse(self, response):
print(response.text)
Note we need to disable the smart proxy for the spider. The proxy is handled by Splash now, not by Scrapy anymore.
Now when we run the spider, the IP of the proxy would be returned, meaning the proxy is used successfully by the Zyte-Splash-Scrapy integration.
$ scrapy crawl httpbin -L ERROR
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
"origin": "104.140.4.254"
}
</pre></body></html>
Use Zyte Headless Proxy
Zyte headless proxy is a proxy that can help users with headless browsers to use the Zyte smart proxy more efficiently.
We can install the headless proxy tool directly on our computer, or use a Docker image instead. We will choose the latter in this post.
It’s better to use Docker Compose to start both Splash and the headless proxy tool together so they can communicate with each other in the same network:
version: '3.8'
services:
splash:
image: scrapinghub/splash:3.5
ports:
- target: 8050
published: 8050
networks:
- scraping
restart: always
zyte-proxy:
image: zytedata/zyte-smartproxy-headless-proxy:1.4.0
ports:
- target: 3128
published: 3128
networks:
- scraping
command: ["--api-key=319xxxxxxxxxxxxxxxxxxxxxxxxxx3b3"]
restart: always
networks:
scraping:
driver: bridge
Note that you should stop the existing Docker container for Splash before starting the services in this docker-compose.yaml file.
We need to update the Lua script because the proxy is handled in a different way now. We will pass a proxy string with all the authentication information:
function use_crawlera(splash)
local session_header = 'X-Crawlera-Session'
local session_id = 'create'
splash:on_request(function (request)
-- Note that a proxy string is used now.
request:set_proxy(splash.args.proxy)
request:set_header('X-Crawlera-Profile', 'desktop')
request:set_header('X-Crawlera-Cookies', 'disable')
request:set_header(session_header, session_id)
end)
splash:on_response_headers(function (response)
if type(response.headers[session_header]) ~= nil then
session_id = response.headers[session_header]
end
end)
end
function main(splash)
use_crawlera(splash)
splash:go(splash.args.url)
return splash:html()
end
Also in the spider, we need to change the args of SplashRequest to pass the proxy string rather than the Zyte API Key:
import scrapy
from pkgutil import get_data
from scrapy_splash import SplashRequest
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["httpbin.org"]
start_urls = ["https://httpbin.org/ip"]
zyte_smartproxy_enabled = False
def __init__(self, *args, **kwargs):
self.lua_source = get_data(
"scraping_proj", "scripts/zyte_splash_manager.lua"
).decode("utf-8")
super().__init__(*args, **kwargs)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint="execute",
args={
"lua_source": self.lua_source,
# A full proxy string is passed now.
"proxy": "http://zyte-proxy:3128",
},
cache_args=['lua_source'],
meta={
'max_retry_times': 10,
}
)
def parse(self, response):
print(response.text)
The full proxy string is http://zyte-proxy:3128. zyte-proxy is the service name for the headless proxy tool and 3128 is the default port. Since Splash and the headless proxy tool are in the same network, they can communicate with each other by service names.
Now the spider is using the proxy provided by the headless proxy tool which is generally more efficient than using the smart proxy directly:
$ scrapy crawl httpbin -L ERROR
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
"origin": "181.215.115.251"
}
</pre></body></html>
In this post, the technical details for how to integrate Zyte smart proxy, Scrapy, and Splash for scraping JavaScript-heavy webpages are introduced. These three tools are developed by the same company and can work efficiently as an ecosystem. However, you don’t need to always use a smart proxy for web scraping if the scraping is not heavy. You can check this post for how to use free proxies with Splash which is mostly helpful for geotargeting web scraping.
Leave a comment