Playwright Web Scraper

A robust, object-oriented web scraping framework built with Python and Playwright designed for reliable data extraction from websites with pagination. The scraper generates structured CSV output and includes comprehensive logging.

Features

Powerful Browser Automation: Uses Playwright for full browser rendering support
Pagination Handling: Automatically detects and navigates through multiple pages
Anti-Detection Measures: Custom user agents and randomized delays
Robust Error Handling: Comprehensive logging and error recovery
Data Preservation: Incremental saving to prevent data loss
Flexible Configuration: Customizable scraping parameters
Concurrent or Sequential Scraping: Choose based on target site requirements

Installation

Clone the repository:

git clone <https://github.com/yourusername/playwright-web-scraper.git>
cd playwright-web-scraper

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install

Usage

Basic Usage

from scraper import ScraperManager

# Create a scraper manager
manager = ScraperManager()

# Add a website to scrape
manager.add_scraper(
    "<https://example.com/directory>",
    "output_data.csv",
    delay_range=(2, 5)
)

# Run the scraper (sequential mode)
manager.run(concurrent=False)

Advanced Configuration

# Create multiple scrapers
manager = ScraperManager()

# Add multiple targets with different configurations
manager.add_scraper(
    "<https://example.com/page1>",
    "output1.csv",
    delay_range=(1, 3)
)

manager.add_scraper(
    "<https://example.com/page2>",
    "output2.csv",
    delay_range=(2, 4)
)

# Run scrapers sequentially (more polite to servers)
manager.run(concurrent=False)

# Or run concurrently if appropriate
# manager.run(concurrent=True)

Customizing the Scraper

To customize the data extraction logic:

Modify the extract_data() method in the WebScraper class to match your target website's structure.
Update the selectors (e.g., .company-item, .name, .location) to match the HTML elements on your target website.

Project Structure

playwright-web-scraper/
├── scraper.py           # Main scraper code
├── requirements.txt     # Project dependencies
└── README.md            # This file

Logging

The scraper creates detailed logs in the format:

scraping_YYYYMMDD_HHMMSS.log

These logs contain information about:

Navigation success/failure
Pages discovered and scraped
Items extracted per page
Error details
Data saving operations

Requirements

Python 3.7+
Playwright
See requirements.txt for complete dependencies

Best Practices

Be respectful of the websites you scrape:
- Use reasonable delays between requests
- Run in sequential mode when possible
- Limit concurrent connections
Check Terms of Service of target websites before scraping
Use for legitimate purposes only

Disclaimer

This tool is for educational purposes only. Users are responsible for ensuring their use of this scraper complies with the target website's terms of service and relevant laws and regulations.

License

Distributed under the GNU Affero General Public License v3.0 License. See LICENSE for more information.