Playwright Web Scraper
A robust, object-oriented web scraping framework built with Python and Playwright designed for reliable data extraction from websites with pagination. The scraper generates structured CSV output and includes comprehensive logging.
Features
- Powerful Browser Automation: Uses Playwright for full browser rendering support
- Pagination Handling: Automatically detects and navigates through multiple pages
- Anti-Detection Measures: Custom user agents and randomized delays
- Robust Error Handling: Comprehensive logging and error recovery
- Data Preservation: Incremental saving to prevent data loss
- Flexible Configuration: Customizable scraping parameters
- Concurrent or Sequential Scraping: Choose based on target site requirements
Installation
- Clone the repository:
git clone <https://github.com/yourusername/playwright-web-scraper.git>
cd playwright-web-scraper
- Install dependencies:
pip install -r requirements.txt
- Install Playwright browsers:
playwright install
Usage
Basic Usage
from scraper import ScraperManager
# Create a scraper manager
manager = ScraperManager()
# Add a website to scrape
manager.add_scraper(
"<https://example.com/directory>",
"output_data.csv",
delay_range=(2, 5)
)
# Run the scraper (sequential mode)
manager.run(concurrent=False)
Advanced Configuration
# Create multiple scrapers
manager = ScraperManager()
# Add multiple targets with different configurations
manager.add_scraper(
"<https://example.com/page1>",
"output1.csv",
delay_range=(1, 3)
)
manager.add_scraper(
"<https://example.com/page2>",
"output2.csv",
delay_range=(2, 4)
)
# Run scrapers sequentially (more polite to servers)
manager.run(concurrent=False)
# Or run concurrently if appropriate
# manager.run(concurrent=True)
Customizing the Scraper
To customize the data extraction logic:
- Modify the
extract_data()
method in theWebScraper
class to match your target website's structure. - Update the selectors (e.g.,
.company-item
,.name
,.location
) to match the HTML elements on your target website.
Project Structure
playwright-web-scraper/
├── scraper.py # Main scraper code
├── requirements.txt # Project dependencies
└── README.md # This file
Logging
The scraper creates detailed logs in the format:
scraping_YYYYMMDD_HHMMSS.log
These logs contain information about:
- Navigation success/failure
- Pages discovered and scraped
- Items extracted per page
- Error details
- Data saving operations
Requirements
- Python 3.7+
- Playwright
- See requirements.txt for complete dependencies
Best Practices
- Be respectful of the websites you scrape:
- Use reasonable delays between requests
- Run in sequential mode when possible
- Limit concurrent connections
- Check Terms of Service of target websites before scraping
- Use for legitimate purposes only
Disclaimer
This tool is for educational purposes only. Users are responsible for ensuring their use of this scraper complies with the target website's terms of service and relevant laws and regulations.
License
Distributed under the GNU Affero General Public License v3.0 License. See LICENSE
for more information.