How to Use Splash with Proxies for Scraping JavaScript Webpages

Learn different ways to use proxies for web scraping with Splash

Image by kreatikar on Pixabay

In web scraping, we often need to use proxies because of IP blocking or geolocation-sensitive data which means the scraped data can change depending on the geolocations of the IPs making the requests. It is straightforward to specify proxies for regular web scraping. However, when it comes to the scraping of JavaScript-rendered webpages using Splash, it can be more complex because the requests need to be made through the Splash headless browser behind the scene.

In this post, we will introduce how to use Splash with proxies for scraping JavaScript-rendered webpages with simple examples.


Preparation

To get started, we need to run a Splash server locally. The easiest way is to use Docker:

docker run -p 8050:8050 --rm scrapinghub/splash:3.5

Then we need to install the requests library for HTTP requests in Python. If you need to do more complex web scraping in Python, you can install the lxml library as well.


Get your IP in web scraping

The easiest way to test if a proxy is used successfully is to test the IP in web scraping.

We can make a request to https://httpbin.org/ip which will return the IP of the request:

import requests

response = requests.get("https://httpbin.org/ip")
print(response.json())
# {'origin': '31.XXX.XXX.16'}

When a proxy is used in web scraping, the IP of the proxy is used in web scraping which can help avoid IP blocking and also get geolocation-sensitive data.

There are many commercial proxy providers out there. If you just want to do some tests, you can use WebShare which provides free proxies for testing.

Let’s use a proxy in our request above and see what IP is returned:

import requests

proxies = {
    "http": "http://USERNAME:PASSWORD@2.XXX.XXX.93:5074",
    "https": "http://USERNAME:PASSWORD@2.XXX.XXX.93:5074",
}

response = requests.get("https://httpbin.org/ip", proxies=proxies)
print(response.json())
# {'origin': '2.XXX.XXX.93'}

It shows the IP of the proxy is returned, rather than the local IP as shown above.


Use a proxy with Splash

Now let’s use a proxy with Splash. We can just specify a proxy query parameter in the Splash API URL. The value of proxy is either a proxy URL or a proxy profile name as we will introduce soon:

import requests

httpbin_url = "https://httpbin.org/ip"
encoded_url = quote(httpbin_url)
proxy = "http://USERNAME:PASSWORD@2.XXX.XXX.93:5074"

splash_url='http://localhost:8050'
api_url = f"{splash_url}/render.html?url={encoded_url}&proxy={proxy}&timeout=5"

response = requests.get(api_url)
print(response.text)

And this is what we get:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "origin": "2.XXX.XXX.93"
}
</pre></body></html>

As we see, the IP of the proxy is returned, meaning the proxy is used successfully by Splash for web scraping. However, two issues should be noted here.

  • Only an HTTP proxy needs to be specified here, even for HTTPS URLs.
  • The plain text is returned if the render.html endpoint is requested. The JSON data is not returned.

To get the JSON data, we need to make requests to the execute endpoint and run some custom Splash script in it. Let’s wrap the code in a function so it can be re-used easily later:

from urllib.parse import quote
import requests

splash_url='http://localhost:8050'
httpbin_url = "https://httpbin.org/ip"
proxy = "http://USERNAME:PASSWORD@2.XXX.XXX.93:5074"

def get_ip_json(proxy, timeout=5):
    encoded_url = quote(httpbin_url)
    api_url = f"{splash_url}/execute?url={encoded_url}&proxy={proxy}&wait={timeout}"

    payload = {
        "lua_source": """
        function main(splash, args)
            local response = splash:http_get(args.url)
            splash:wait(args.wait)
            return response.body
        end
        """
    }

    response = requests.post(api_url, json=payload)
    print(response.text)


get_ip_json(proxy)
# {'origin': '2.XXX.XXX.93'}

A simple Lua script is used to make the HTTP request and return the response body containing the JSON data.

If we call the get_ip_json() function with the proxy, the JSON data can be returned as in our first example.


Use Proxy Profiles

When you need to use multiple proxies in your scraping jobs, for example using different proxies for different countries, it is more convenient to use “proxy profiles”, rather than specifying them explicitly in the code.

A proxy profile is a configuration file in the INI format where you can specify the HTTP proxy configurations including host, port, username, password, etc. Each profile is identified by the file name without the extension .ini.

Let’s create a folder called proxy-profiles in the current folder and then create two files proxy-es.ini and proxy-us.ini in it:

proxy-profiles/
├── proxy-es.ini
└── proxy-us.ini

The content for proxy-es.ini is:

[proxy]
; required
host=185.XXX.XXX.156
port=7492

; optional, default is no auth
username=USERNAME
password=PASSWORD

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

And the content for proxy-us.ini is:

[proxy]
; required
host=2.XXX.XXX.93
port=5074

; optional, default is no auth
username=USERNAME
password=PASSWORD

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

Remember to replace the configurations with your own to make it work for you.

Then we can bind the proxy-profiles folder to the Docker container:

docker run -it -p 8050:8050 -v ./proxy-profiles:/etc/splash/proxy-profiles --rm scrapinghub/splash:3.5

Note that you need to stop the existing Docker container for Splash before starting a new one, otherwise, you will have port conflicts.

The corresponding docker-compose.yaml file for Docker Compose is:

version: '3.8'

services:
  splash:
    image: scrapinghub/splash:3.5
    ports:
      - target: 8050
        published: 8050
    volumes:
      - type: bind
        source: ./proxy-profiles
        target: /etc/splash/proxy-profiles
    restart: always

When the container is run, you will see proxy profiles support is enabled in the log:

2023-05-26 06:12:42.115017 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles

Now we can start to use the proxy profiles for scraping. Note that we are using proxy profile file names (without the extension) for reference in the code:

get_ip_json(proxy="proxy-us")
# {'origin': '2.XXX.XXX.93'}

get_ip_json(proxy="proxy-es")
# {'origin': '185.XXX.XXX.156'}

Cheers! The proxy profiles for different countries are used successfully in scraping.


Related articles:



Leave a comment

Blog at WordPress.com.