DuckDB Analyzer

A powerful tool for analyzing large CSV datasets using DuckDB - a high-performance analytical database system.

Example - Sample 10 random rows from a CSV file:

🚀 Overview

DuckDB Analyzer simplifies working with large CSV datasets by leveraging the speed and efficiency of DuckDB. It provides a user-friendly CLI and Python API for common data analysis tasks without requiring complex database setup.

Key Features:

Lightning-fast CSV import and querying
Memory-efficient processing of large datasets
Simple command-line interface for common operations
Python API for integration with existing workflows
No database server or setup required

📋 Requirements

Python 3.7+
Dependencies:
- duckdb
- pandas

🔧 Installation

# Clone the repository
git clone <https://github.com/yourusername/duckdb-analyzer.git>
cd duckdb-analyzer

# Install dependencies
pip install -r requirements.txt

Alternatively, create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate
pip install -r requirements.txt

📊 Usage

Command Line Interface

python duckdb_analyzer.py [options]

Get Sample Data

Some great sample data CSV files are available here for free.

Examples:

Count rows in a CSV file:

python duckdb_analyzer.py --file data.csv --action count

Sample 10 random rows from a CSV file:

python duckdb_analyzer.py --file data.csv --action sample --limit 10 --random

Import a CSV file into a DuckDB table:

python duckdb_analyzer.py --file data.csv --action import --table my_data

Get statistics for a specific column:

python duckdb_analyzer.py --file data.csv --action stats --column age

Perform group-by analysis:

python duckdb_analyzer.py --file data.csv --action group --column category

Execute a custom SQL query:

python duckdb_analyzer.py --action query --sql "SELECT * FROM 'data.csv' WHERE id > 100 LIMIT 5"

Python API

from duckdb_analyzer import DuckDBAnalyzer

# Use as a context manager
with DuckDBAnalyzer() as analyzer:
    # Count rows in a CSV file
    count = analyzer.count_rows("data.csv")
    print(f"Found {count:,} rows")

    # Sample data
    df = analyzer.sample_data("data.csv", rows=5, random=True)
    print(df)

    # Import into a table
    analyzer.import_csv_to_table("data.csv", "my_table")

    # Run a custom query
    result = analyzer.execute_query("SELECT * FROM my_table WHERE age > 30")

🔍 Available Actions

Action	Description	Required Args	Optional Args
`count`	Count rows in a CSV file	`--file`	-
`sample`	Show sample rows from a file	`--file`	`--limit`, `--random`
`import`	Import CSV to a DuckDB table	`--file`, `--table`	`--overwrite`
`stats`	Get statistics for a column	`--file`, `--column`	-
`schema`	Show table schema	`--table`	-
`compression`	Show table compression info	`--table`	-
`group`	Perform group-by analysis	`--file`, `--column`	-
`query`	Run a custom SQL query	`--sql`	-

🧪 Performance

DuckDB Analyzer significantly outperforms traditional Python data processing methods for large datasets.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🙏 Acknowledgements

DuckDB - The analytical database system that powers this tool
Pandas - For data manipulation and analysis
DataBlist - For free large sample CSV files for testing.

📜 License

Distributed under the GNU Affero General Public License v3.0 License. See LICENSE for more information.