Project Documentation
PART 1 — IAM Roles
Role 1 — Lambda Role
IAM → Roles → Create roleTrusted entity type: AWS ServiceService: LambdaClick NextSkip adding managed policies for now → click NextRole name: lambda-pipeline-execution-roleCreate role
Now add permissions — search for your new role and click it:→Permissions tab → Add permissions → Create inline policy→Click JSON tab → paste this:{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["glue:StartCrawler","glue:GetCrawler","glue:StartJobRun","glue:GetJobRun"],"Resource": "*"},{"Effect": "Allow","Action": ["s3:GetObject","s3:PutObject","s3:ListBucket"],"Resource": ["arn:aws:s3:::e-commerce-data-kpc","arn:aws:s3:::e-commerce-data-kpc/*"]},{"Effect": "Allow","Action": ["logs:CreateLogGroup","logs:CreateLogStream","logs:PutLogEvents"],"Resource": "*"}]}→Click Next→Policy name: lambda-pipeline-policy→Create policyRole 2 — Glue RoleIAM→Roles→Create role
→Trusted entity type: AWS Service→Service: Glue→Click Next→Search "AWSGlueServiceRole" → check it → Next→Role name: glue-pipeline-service-role
One Role — Two Policies Inside It
glue-pipeline-service-role├──AWSGlueServiceRole ← managed policy (attached during role creation)└──glue-s3-access ← inline policy (you add this after)→Create role→Permissions tab → Add permissions → Create inline policy→JSON tab → paste this:{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["s3:GetObject","s3:PutObject","s3:DeleteObject","s3:ListBucket"],"Resource": ["arn:aws:s3:::e-commerce-data-kpc","arn:aws:s3:::e-commerce-data-kpc/*"]}]}
Policy name: glue-s3-accessCreate policy
PART 2 — S3 FoldersS3 → e-commerce-data-kpc → Create folderCreate these 4 folders one by one:| Folder name |------------------------| raw || etl_processed_data || athena-results || glue-scripts |Then upload `orders.json` into the `raw/` folder and upload `etl_ecommerce.py` script into `glue-scripts/` folder.PART 3 — Glue SetupCreate DatabaseGlue → Databases → Add database→Name: ecommerce_db→CreateCreate CrawlerGlue → Crawlers → Create crawler→Name: ecommerce-crawler→Next→Add data source → S3→S3 path: s3://e-commerce-data-kpc/raw/→Add → Next→IAM Role: glue-pipeline-service-role→Next→Target database: ecommerce_db→Next → Create crawlerCreate ETL JobGlue → ETL Jobs → Script editor
Engine: SparkStart freshJob name: etl_ecommerceIAM Role: glue-pipeline-service-roleGlue version: Glue 4.0Worker type: G.1XNumber of workers: 2
Paste the ETL script in the editor and click SavePART 4 — Lambda FunctionsLambda 1 — s3-crawler-triggerLambda → Create function
Function name: s3-crawler-triggerRuntime: Python 3.12Permissions → Use existing role → lambda-pipeline-execution-roleCreate function
Paste code → clickDeploy:import boto3CRAWLER_NAME = 'ecommerce-crawler'def lambda_handler(event, context):glue = boto3.client('glue')bucket = event['Records'][0]['s3']['bucket']['name']key = event['Records'][0]['s3']['object']['key']print(f"File uploaded: s3://{bucket}/{key}")response = glue.get_crawler(Name=CRAWLER_NAME)state = response['Crawler']['State']print(f"Crawler state: {state}")if state == 'RUNNING':print("Crawler already running — skipping")return {"status": "skipped"}glue.start_crawler(Name=CRAWLER_NAME)print(f"Crawler started!")return {"status": "success"}Set timeout:Configuration → General configuration → Edit→Timeout: 1 min 0 sec → SaveLambda 2 — etl-job-triggerLambda → Create function
→Function name: etl-job-trigger→Runtime: Python 3.12→Permissions → Use existing role → lambda-pipeline-execution-role→Create function
Paste code → clickDeploy:import boto3ETL_JOB_NAME = 'etl_ecommerce'def lambda_handler(event, context):glue = boto3.client('glue')print(f"Event received: {event}")state = event['detail']['state']crawler_name = event['detail']['crawlerName']print(f"Crawler: {crawler_name} | State: {state}")if state != 'Succeeded':print(f"Crawler state was {state} — ETL not triggered")return {"status": "skipped"}response = glue.start_job_run(JobName=ETL_JOB_NAME)print(f"ETL job started! RunId: {response['JobRunId']}")return {"status": "success", "jobRunId": response['JobRunId']}Set timeout:Configuration → General configuration → Edit→Timeout: 1 min 0 sec → SavePART 5 — S3 Event NotificationS3 → e-commerce-data-kpc → Properties→Event notifications → Create event notification| Field | Value |---------------------------------------------------| Event name | json-upload-trigger || Prefix | raw/ || Suffix | .json || Event types | All object create events || Destination | Lambda function || Lambda function | `s3-crawler-trigger` |→Save changesThis automatically adds the S3 permission to invoke Lambda 1PART 6 — EventBridge RuleEventBridge → Rules → Create rule→Name: on-crawler-complete→Event bus: default→Rule type: Rule with an event pattern→NextIn event pattern section:→Event source: AWS services→AWS service: Glue→Event type: Glue Crawler State Change→Switch to "Edit pattern" and paste:{"source": ["aws.glue"],"detail-type": ["Glue Crawler State Change"],"detail": {"crawlerName": ["ecommerce-crawler"],"state": ["Succeeded"]}}→Next→Target type: AWS service→Select a target: Lambda function→Function: etl-job-trigger→Next → Next → Create ruleThis automatically adds the EventBridge permission to invoke Lambda 2PART 7 — Athena SetupAthena → Settings → Manage→Query result location: s3://e-commerce-data-kpc/athena-results/→SaveFinal ChecklistIAM: lambda-pipeline-execution-role created with inline policyIAM: glue-pipeline-service-role created with AWSGlueServiceRole + inline policyS3: folders created (raw, etl_processed_data, athena-results, glue-scripts)S3: ETL script uploaded to glue-scripts/Glue: ecommerce_db database createdGlue: ecommerce-crawler created → points to raw/Glue: etl_ecommerce job created with scriptLambda 1: s3-crawler-trigger deployed with 1 min timeoutLambda 2: etl-job-trigger deployed with 1 min timeoutS3 Event Notification → s3-crawler-trigger (auto-adds permission)EventBridge Rule → etl-job-trigger (auto-adds permission)Athena: result location set
No comments:
Post a Comment