Azure-Medallion-API
A data pipeline using the Medallion architecture (Bronze, Silver, Gold layers) on Microsoft Azure, with a RESTful API built using FastAPI to serve the processed data.
It ingests customer purchase data from a CSV file, processes it through multiple stages, and exposes it via an API.
Project Structure
- Bronze Layer: Raw data ingestion into Azure Data Lake Storage.
- Silver Layer: Data cleaning and transformation using Azure Databricks.
- Gold Layer: Final processed data for consumption, stored in Azure Data Lake and loaded into Azure SQL Database.
- API: FastAPI application to query purchase data and analytics.
Files
-
upload_to_bronze.py
: Uploadsrandom_data.csv
to the Bronze layer in Azure Data Lake Storage. -
silver_processing.py
: Databricks notebook to clean and transform raw data into the Silver layer. -
gold_processing.py
: Databricks notebook to process Silver data into the Gold layer. -
load_to_sql.py
: Loads Gold data into Azure SQL Database and creates analytics views. -
api.py
: FastAPI application to serve purchase data and analytics.
Prerequisites
- Azure Services:
- Azure Data Lake Storage (containers:
bronze
,silver
,gold
) - Azure Databricks
- Azure SQL Database
- Azure Key Vault (optional for secrets)
- Azure Data Lake Storage (containers:
- Local Environment:
- Python 3.9+
- ODBC Driver 17 for SQL Server
- Install dependencies:
pip install pandas pyspark pyarrow fastapi uvicorn sqlalchemy pyodbc azure-storage-blob azure-identity
Configuration:
Set environment variables:
- STORAGE_ACCOUNT_NAME: Azure Data Lake Storage account name
- KEY_VAULT_URL: Azure Key Vault URL (optional)
- DB_USERNAME, DB_PASSWORD, DB_SERVER, DB_DATABASE: SQL Database credentials (if not using Key Vault)
- AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID: For ClientSecretCredential (optional)
- Databricks secret scope medallion-kv with key storage-account-name.
Setup Instructions
- Prepare Data:
- Save your customer purchase data as random_data.csv locally.
- Ensure it matches the expected schema (see silver_processing.py).
- Upload to Bronze:
python upload_to_bronze.py
- Uploads random_data.csv to bronze/random_data.csv.
- Process to Silver:
- Run silver_processing.py in a Databricks notebook.
- Outputs cleaned_data.parquet in silver/.
- Process to Gold:
- Run gold_processing.py in a Databricks notebook.
- Outputs purchases.parquet in gold/.
- Load to SQL:
python load_to_sql.py
- Loads data into Azure SQL Database table purchases and creates views sales_by_type and sales_by_city.
- Run API:
python api.py
- Starts the FastAPI server at http://localhost:8000.
API Endpoints
- GET /purchases/?unique_id={unique_id}:
- Returns a single purchase by UniqueID.
- Example: http://localhost:8000/purchases?unique_id=e7a8b3ee-72ee-4767-bfd0-66b44b789fa7
- GET /purchases/?lastname={lastname}:
- Returns all purchases for a given LastName (case-insensitive).
- Example: http://localhost:8000/purchases?lastname=Bauer
- GET /analytics/sales_by_type:
- Returns total sales by purchase type.
- Example: http://localhost:8000/analytics/sales_by_type
- GET /analytics/sales_by_city:
- Returns total sales by city.
- Example: http://localhost:8000/analytics/sales_by_city
Deployment (Optional)
To deploy on an Azure VM:
- Create an Ubuntu VM in Azure.
- SSH into the VM:
ssh <username>@<vm-public-ip>
- Install dependencies:
sudo apt update
sudo apt install python3-pip
pip3 install -r requirements.txt
- Create requirements.txt with the listed dependencies.
Copy files to the VM (e.g., via SCP):
scp api.py <username>@<vm-public-ip>:~/api.py
- Run the API:
nohup uvicorn api:app --host 0.0.0.0 --port 80 &
- Open port 80 in Azure portal (Networking > Inbound rules).
- Access at http://<vm-public-ip>/.
Notes
- Schema Limitation: silver_processing.py excludes some CSV fields (e.g., Telephone, Email). Update the schema if needed.
- Error Handling: Scripts include logging and exception handling for debugging.
- Security: Use Key Vault for sensitive credentials in production.
Troubleshooting
- ODBC Error: Install ODBC Driver 17 for SQL Server (sudo apt install -y msodbcsql17 on Ubuntu).
- Secret Scope: Configure medallion-kv in Databricks with storage-account-name.
- Key Vault: Ensure secrets (db-username, etc.) are set if using KEY_VAULT_URL.
License
- Distributed under the GNU Affero General Public License v3.0 License. See
LICENSE
for more information.