History: Spark Projects

1. E-Commerce Product Catalog with SCD Type 2

Develop a product inventory system that tracks price changes over time using the Slowly Changing Dimension Type 2 approach.

-Ingest product data from a REST API (like Fake Store API)

-Parse JSON responses and transform data using Spark

-Handle Missing Values, Null Values, Duplicate records

-Calculate the percentage of bad records and good records

-Process JSON data with different optimization techniques (caching, partitioning, bucketing, broadcast joins)

-Implement SCD2 in Delta tables to maintain price history

-Store results in both delta and PostgreSQL

-Query historical prices and Current prices

-Structure Project into Packages- Create separate modules for extraction, transformation, and loading

-Generate Different Test case scenarios to check the validity of methods

-Use Jenkins for CI/CD

-Use Linux commands to create a shell script to execute the Spark jar file.

-----------------------------------------------------------------------------------------------------------------------

2. Weather Data Pipeline with Performance Optimization

Create a batch pipeline that fetches weather data and optimizes Spark processing.

-Pull historical weather data from OpenWeather API

-Handle Missing Values, Null Values, Duplicate records

-Load Clean data

-Process JSON data with different optimization techniques (caching, partitioning, bucketing, broadcast joins)

-Store results in PostgreSQL for reporting

-Package as: Create separate modules for extraction, transformation, and loading

-Generate Different Test case scenarios to check the validity of methods

-prepare JAR file for execution

-Use Jenkins for CI/CD

-Use Linux Commands to create a shell script to execute the Spark Jar file

---------------------------------------------------------------------------------------------------------------------------

3. User Activity Tracker with SCD Type 1

Build a batch system tracking user profile updates where only the current state matters.

1)Generate mock user activity data or pull from a user API

2)Implement SCD1 logic to overwrite old records

3)Use Spark to process updates in daily batches

4)Store the final state in both Delta and PostgreSQL

Data Modeling Concepts:

1)Slowly Changing Dimension Type 1: Overwrite strategy, no history

2)Natural Keys vs Surrogate Keys: Use email as a natural key, generate a surrogate for joins

3)Denormalization: Flatten nested JSON (address, location) for analytics

4) Check invalid, null, or missing records- perform Data Quality

5) Data Types: Proper type selection (VARCHAR, TIMESTAMP, BOOLEAN)

5) Following optimization techniques(repartitioning,coalesce)

6) Constraints: Primary keys, unique constraints, and not null constraints. Indexing Strategy: Index on frequently queried columns (user_id, email, last_updated)

1)Generate Different Test case scenarios to check the validity of methods

2) Scala Package Structure: model, service,scd_type, main

3) Prepare JAR file for execution

4) Use Jenkins for CI/CD

5) Use Linux commands to create a shell script to execute the Spark jar file

---------------------------------------------------------------------------------------------------------------------------

4) Stock Market Historical Analyzer with SCD Type 3

Track limited stock history (current price, previous price, last change date) using batch loads.

1) Fetch historical stock data from Alpha Vantage API

2) Apply Schema to the data

2) Check invalid, null, or missing records- perform Data Quality

3) Implement SCD3 with current and previous value columns

4) Use Spark SQL for transformations

5) Perform Optimization using Caching and Persist

6) Create a Delta table with optimized partitioning by date

7) Scala Package Structure: API, DELTA/POSTGRE, Main Logic

8) Generate Different Test case scenarios to check the validity of methods

9)Prepare JAR file for execution

10)Use Jenkins for CI/CD

11) Use Linux Commands to create shell scripts to execute the Spark jar file

----------------------------------------------------------------------------------------------------------------------------

5) Customer Orders Data Warehouse

Build a mini data warehouse combining all three SCD types with batch processing.

1)Customer data (SCD2 - track address changes over time)

2)Product prices (SCD1 - keep only current price)

3)Payment methods (SCD3 - current and previous method)

4)Ingest from multiple JSON files via API

5)Handle Missing, Null Values as Data Quality checks

6)Implement daily incremental loads with Delta merge operations

7)Scala Package Structure:

Com.datawarehouse.orders

├── domain

│ ├── customer (Customer models + SCD2)

│ ├── product (Product models + SCD1)

│ ├── payment (Payment models + SCD3)

│ ├── dimension (all dimension tables)

│ └── fact (OrderFact, InventorySnapshot)

├── etl

│ ├── extract

│ ├── transform

│ └── load

├── shared (common utilities)

└── app (main application)

7) Generate Different Test case scenarios to check the validity of methods.

8) Prepare JAR file for execution

9) Use Jenkins for CI/CD

10) Use Linux Commands to create shell scripts to executethe Sparkk jar file

----------------------------------------------------------------------------------------------------------------------------

6)Process application logs in batches with focus on Spark optimization techniques.

1)Read JSON logs from files or a REST endpoint

2)Handle the Data Quality issue of missing, null, and bad records

3)Apply various optimization techniques: predicate pushdown, column pruning, handling data skewness, if any, with the salting technique

4)Include various Spark transformations and Actions

5)Implement CDC

6)Store records in PostgreSQL

7)Execute PostgreSQL scripts, Temp View, Stored Procedure, and User

Defined Function.

8)Generate Different Test case scenarios to check the validity of methods

9)Scala Package Structure: API, Schema, metrics,json_parser

10)prepare JAR file for execution

11)Create a Shell Script to exeute spark JAR file using Linux Commands

12)USE Jenkins for CI/CD

8) Use Jenkins for CI/CD

History

Tuesday, 16 December 2025

Spark Projects

No comments:

Post a Comment