Version: Next

Browser automation with Selenium

In this guide, you'll learn how to use Selenium for browser automation and web scraping in your Apify Actors.

Introduction

Selenium is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would.

Some of the key features of Selenium for web scraping include:

Broad ecosystem - Selenium has a large community and extensive documentation, with support for multiple programming languages beyond Python.
WebDriver protocol - Selenium uses the W3C WebDriver protocol, providing standardized browser automation that works with Chrome, Firefox, Edge, and Safari.
Headless and headful modes - Selenium can run with or without a visible browser window, making it suitable for both local development and containerized environments.
Flexible element selection - Selenium provides CSS selectors, XPath, ID, class name, and other strategies for locating elements on a page.
User interaction emulation - Selenium allows you to emulate user actions like clicking, scrolling, filling out forms, and typing, which is useful for scraping dynamic websites.

To create Actors which use Selenium, start from the Selenium & Python Actor template.

On the Apify platform, the Actor will already have Selenium and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode.

When running the Actor locally, you'll need to install the Selenium browser drivers yourself. Refer to the Selenium documentation for installation instructions.

Example Actor

This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.

It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.

Run on

import asyncio
from typing import Any
from urllib.parse import urljoin, urlsplit

from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.by import By

from apify import Actor, Request
from apify.storages import RequestQueue

# To run locally, install the Selenium Chromedriver:
# https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/
# On the Apify platform, it's already in the Actor's Docker image.


def build_chrome_driver() -> webdriver.Chrome:
    """Create a headless Chrome WebDriver suitable for a container."""
    chrome_options = ChromeOptions()

    if Actor.configuration.headless:
        chrome_options.add_argument('--headless=new')

    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')

    return webdriver.Chrome(options=chrome_options)


def scrape_page(driver: webdriver.Chrome, url: str) -> tuple[dict[str, Any], list[str]]:
    """Navigate to the URL with Selenium and return its data and same-site links."""
    driver.get(url)

    data = {
        'url': url,
        'title': driver.title,
        'h1s': [el.text for el in driver.find_elements(By.TAG_NAME, 'h1')],
        'h2s': [el.text for el in driver.find_elements(By.TAG_NAME, 'h2')],
        'h3s': [el.text for el in driver.find_elements(By.TAG_NAME, 'h3')],
    }

    # Keep only absolute links on the same host.
    links: list[str] = []
    host = urlsplit(url).netloc
    for link in driver.find_elements(By.TAG_NAME, 'a'):
        link_url = urljoin(url, link.get_attribute('href'))
        if not link_url.startswith(('http://', 'https://')):
            continue
        if urlsplit(link_url).netloc == host:
            links.append(link_url)

    return data, links


async def enqueue_links(
    request_queue: RequestQueue,
    links: list[str],
    *,
    depth: int,
    max_depth: int,
) -> None:
    """Enqueue the links one level deeper, unless max_depth was reached."""
    if depth >= max_depth:
        return

    for link_url in links:
        Actor.log.info(f'Enqueuing {link_url} ...')
        request = Request.from_url(link_url)
        request.crawl_depth = depth + 1
        await request_queue.add_request(request)


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        max_depth = actor_input.get('maxDepth', 1)

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Open the request queue and enqueue the start URLs (crawl depth 0).
        request_queue = await Actor.open_request_queue()
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing start URL: {url}')
            await request_queue.add_request(Request.from_url(url))

        # Cap the crawl. Raise or remove the limit to follow more pages.
        max_requests = 10
        handled_requests = 0

        Actor.log.info('Launching Chrome WebDriver...')
        driver = build_chrome_driver()

        while handled_requests < max_requests and (
            request := await request_queue.fetch_next_request()
        ):
            handled_requests += 1
            url = request.url
            depth = request.crawl_depth
            Actor.log.info(f'Scraping {url} (depth={depth}) ...')

            try:
                # Blocking WebDriver calls run in a worker thread.
                data, links = await asyncio.to_thread(scrape_page, driver, url)
                await Actor.push_data(data)
                Actor.log.info(
                    f'Stored data from {url} '
                    f'(title={data["title"]!r}, {len(links)} links found).'
                )
                await enqueue_links(
                    request_queue, links, depth=depth, max_depth=max_depth
                )

            except Exception:
                Actor.log.exception(f'Cannot extract data from {url}.')

            finally:
                await request_queue.mark_request_as_handled(request)

        driver.quit()


if __name__ == '__main__':
    asyncio.run(main())

Using Apify Proxy

Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. The runnable example Actor skips the proxy to stay simple. This section extends it to route the browser through Apify Proxy. The snippet below isn't a complete, runnable Actor on its own. It shows only the proxy-specific parts you add to the example Actor.

Chrome ignores the credentials passed in the --proxy-server flag. To use an authenticated proxy such as Apify Proxy, configure it from inside a small extension. The proxy_auth_extension helper builds one at runtime. Its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. The proxy-aware build_chrome_driver below replaces the simple one from the example Actor and loads that extension. The new headless mode (--headless=new) is required for Chrome to load it.

import json
from pathlib import Path
from tempfile import mkdtemp
from urllib.parse import urlsplit
from zipfile import ZipFile

from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions

from apify import Actor


def proxy_auth_extension(proxy_url: str) -> str:
    """Build a Chrome extension that routes Chrome through an authenticated proxy."""
    parts = urlsplit(proxy_url)

    manifest = {
        'name': 'Apify Proxy',
        'version': '1.0.0',
        'manifest_version': 3,
        'permissions': ['proxy', 'webRequest', 'webRequestAuthProvider'],
        'host_permissions': ['<all_urls>'],
        'background': {'service_worker': 'background.js'},
        'minimum_chrome_version': '108',
    }

    # The service worker sets the proxy and answers the auth challenge.
    proxy_config = json.dumps(
        {
            'mode': 'fixed_servers',
            'rules': {
                'singleProxy': {
                    'scheme': parts.scheme,
                    'host': parts.hostname,
                    'port': parts.port,
                },
            },
        }
    )
    credentials = json.dumps(
        {'username': parts.username or '', 'password': parts.password or ''}
    )
    background = (
        'chrome.proxy.settings.set('
        '{value: ' + proxy_config + ', scope: "regular"});\n'
        'chrome.webRequest.onAuthRequired.addListener(\n'
        '    () => ({authCredentials: ' + credentials + '}),\n'
        '    {urls: ["<all_urls>"]},\n'
        '    ["blocking"],\n'
        ');\n'
    )

    extension_path = Path(mkdtemp()) / 'apify_proxy.zip'
    with ZipFile(extension_path, 'w') as archive:
        archive.writestr('manifest.json', json.dumps(manifest))
        archive.writestr('background.js', background)
    return str(extension_path)


def build_chrome_driver(proxy_url: str) -> webdriver.Chrome:
    """Create a headless Chrome WebDriver routed through an authenticated proxy."""
    chrome_options = ChromeOptions()

    if Actor.configuration.headless:
        # The new headless mode is required to load the proxy extension.
        chrome_options.add_argument('--headless=new')

    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')

    # Load the proxy extension and keep it enabled in headless mode.
    chrome_options.add_extension(proxy_auth_extension(proxy_url))
    chrome_options.add_argument(
        '--disable-features=DisableLoadExtensionCommandLineSwitch'
    )

    return webdriver.Chrome(options=chrome_options)

To wire the proxy into the example Actor, create the proxy configuration in main with Actor.create_proxy_configuration, get a URL with await proxy_configuration.new_url(), and pass it to build_chrome_driver. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Conclusion

In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Browser automation with Selenium

Introduction

Example Actor

Using Apify Proxy

Conclusion

Additional resources