How to Use Zyte Smart Proxy with Scrapy and Splash

Learn how to scrape JavaScript webpages with a smart proxy

Image by kreatikar on Pixabay

In this post, we will introduce how to use the Zyte Smart proxy with Splash which is integrated with the Scrapy web scraping framework. We will learn how to set up the Zyte smart proxy, how to use it with Splash, and how to set up a headless proxy tool for more efficient usage of the smart proxy. Hopefully, this post can help you solve some issues in your system.


Create a scraping project and a spider

First, we need to create a virtual environment and install the libraries needed for this post:

conda create -n scrapy python=3.11
conda activate scrapy

pip install Scrapy==2.9.0
pip install scrapy-splash==0.9.0
pip install scrapy-zyte-smartproxy==2.2.0

Then we can create a scraping project and also a spider in it:

scrapy startproject scraping_proj

cd scraping_proj

scrapy genspider httpbin httpbin.org

We will create a super simple spider that just scrapes https://httpbin.org/ip and return the IP. This simple spider can demonstrate if the proxy is used successfully or not.

import scrapy

class HttpbinSpider(scrapy.Spider):
    name = "httpbin"
    allowed_domains = ["httpbin.org"]
    start_urls = ["https://httpbin.org/ip"]

    def parse(self, response):
        print(response.text)

If we run the spider directly, we can see our local IP is being used:

$ scrapy crawl httpbin -L WARNING

{
  "origin": "94.xxx.xxx.20"
}

Use Zyte smart proxy directly

If we are not scraping JavaScript-created websites, we can use the Zyte smart proxy directly, which is also the most common case.

We need to modify settings.py and add the following configurations to it.

# Zyte Smart Proxy Manager configuration
DOWNLOADER_MIDDLEWARES = {'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}

ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = '319xxxxxxxxxxxxxxxxxxxxxxxxxx3b3'

The Zyte API Key can be obtained from the dashboard:

Now if we run the spider, we can see that the IP of the proxy is returned:

$ scrapy crawl httpbin -L WARNING

{
  "origin": "173.xxx.xxx.154"
}

Use Zyte with Scrapy-Splash

Splash can be integrated with Scrapy with the scrapy-splash plugin which makes it easier to use Splash in Scrapy.

Let’s first start a Splash server locally using Docker:

docker run -it -p 8050:8050 scrapinghub/splash:3.5

Then we need to add Splash-specific configurations in settings.py:

# Splash settings.
DOWNLOADER_MIDDLEWARES.update({
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
})

SPLASH_URL = 'http://127.0.0.1:8050'

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Since the Lua script used would be pretty long, it’s better to put it in a specific file with the .lua extension to have better syntax highlight, otherwise, it’s very difficult to read the code in a big string.

Let’s create a new folder scripts in the project folder and create a Lua script named zyte_splash_manager.lua in it, which will have the content as follows:

function use_crawlera(splash)
    local user = splash.args.zyte_apikey
    local password = ''
    
    local host = 'proxy.zyte.com'
    -- Note 8010 is used here. 8010 is more stable than 8011 for Splash!
    local port = 8010

    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        request:set_proxy(host, port, user, password)
        request:set_header('X-Crawlera-Profile', 'desktop')
        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_header(session_header, session_id)
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
    splash:go(splash.args.url)
    return splash:html()
end

Finally, we can update our spider to read the Lua script and use it to make the request:

import scrapy
from pkgutil import get_data
from scrapy_splash import SplashRequest

class HttpbinSpider(scrapy.Spider):
    name = "httpbin"
    allowed_domains = ["httpbin.org"]
    start_urls = ["https://httpbin.org/ip"]

    # Note we need to disable the smart proxy for the spider.
    zyte_smartproxy_enabled = False

    def __init__(self, *args, **kwargs):
        self.lua_source = get_data(
            "scraping_proj", "scripts/zyte_splash_manager.lua"
        ).decode("utf-8")
        super().__init__(*args, **kwargs)

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint="execute",
                args={
                    "lua_source": self.lua_source,
                    "zyte_apikey": self.settings["ZYTE_SMARTPROXY_APIKEY"],
                },
                cache_args=['lua_source'],
                meta={
                    'max_retry_times': 10,
                }
            )

    def parse(self, response):
        print(response.text)

Note we need to disable the smart proxy for the spider. The proxy is handled by Splash now, not by Scrapy anymore.

Now when we run the spider, the IP of the proxy would be returned, meaning the proxy is used successfully by the Zyte-Splash-Scrapy integration.

$ scrapy crawl httpbin -L ERROR

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "origin": "104.140.4.254"
}
</pre></body></html>

Use Zyte Headless Proxy

Zyte headless proxy is a proxy that can help users with headless browsers to use the Zyte smart proxy more efficiently.

We can install the headless proxy tool directly on our computer, or use a Docker image instead. We will choose the latter in this post.

It’s better to use Docker Compose to start both Splash and the headless proxy tool together so they can communicate with each other in the same network:

version: '3.8'

services:
  splash:
    image: scrapinghub/splash:3.5
    ports:
      - target: 8050
        published: 8050
    networks:
      - scraping
    restart: always
 
  zyte-proxy:
    image: zytedata/zyte-smartproxy-headless-proxy:1.4.0
    ports:
      - target: 3128
        published: 3128
    networks:
      - scraping
    command: ["--api-key=319xxxxxxxxxxxxxxxxxxxxxxxxxx3b3"]
    restart: always

networks:
  scraping:
    driver: bridge

Note that you should stop the existing Docker container for Splash before starting the services in this docker-compose.yaml file.

We need to update the Lua script because the proxy is handled in a different way now. We will pass a proxy string with all the authentication information:

function use_crawlera(splash)
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        -- Note that a proxy string is used now.
        request:set_proxy(splash.args.proxy)
        request:set_header('X-Crawlera-Profile', 'desktop')
        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_header(session_header, session_id)
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
    splash:go(splash.args.url)
    return splash:html()
end

Also in the spider, we need to change the args of SplashRequest to pass the proxy string rather than the Zyte API Key:

import scrapy
from pkgutil import get_data
from scrapy_splash import SplashRequest

class HttpbinSpider(scrapy.Spider):
    name = "httpbin"
    allowed_domains = ["httpbin.org"]
    start_urls = ["https://httpbin.org/ip"]
    zyte_smartproxy_enabled = False

    def __init__(self, *args, **kwargs):
        self.lua_source = get_data(
            "scraping_proj", "scripts/zyte_splash_manager.lua"
        ).decode("utf-8")
        super().__init__(*args, **kwargs)

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint="execute",
                args={
                    "lua_source": self.lua_source,
                    # A full proxy string is passed now.
                    "proxy": "http://zyte-proxy:3128",
                },
                cache_args=['lua_source'],
                meta={
                    'max_retry_times': 10,
                }
            )

    def parse(self, response):
        print(response.text)

The full proxy string is http://zyte-proxy:3128. zyte-proxy is the service name for the headless proxy tool and 3128 is the default port. Since Splash and the headless proxy tool are in the same network, they can communicate with each other by service names.

Now the spider is using the proxy provided by the headless proxy tool which is generally more efficient than using the smart proxy directly:

$ scrapy crawl httpbin -L ERROR

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "origin": "181.215.115.251"
}
</pre></body></html>

In this post, the technical details for how to integrate Zyte smart proxy, Scrapy, and Splash for scraping JavaScript-heavy webpages are introduced. These three tools are developed by the same company and can work efficiently as an ecosystem. However, you don’t need to always use a smart proxy for web scraping if the scraping is not heavy. You can check this post for how to use free proxies with Splash which is mostly helpful for geotargeting web scraping.


Related posts:



Leave a comment

Blog at WordPress.com.