Batch vs Streaming Ingestion Patterns - DEA-C01 Data Engineer Study Notes

The batch versus streaming ingestion decision is the single most architecture-defining choice a data engineer makes for a new pipeline. It dictates which AWS services are even on the table, what the cost structure looks like, how operations teams will alert on failure, and how downstream analytics consumers will see the data. Community study guides for the DEA-C01 exam — Tutorials Dojo, ExamCert.App, Digital Cloud Training, and the popular Medium walkthroughs — all flag the same pain point: candidates underestimate the nuance behind batch ingestion vs streaming selection under latency requirements. The exam plants scenario questions where the wrong answer is "Kinesis Data Firehose" because the requirement says "real-time" (Firehose is near-real-time), or "Lambda" because the requirement is "serverless" (Lambda has a 15-minute ceiling that breaks long ETL jobs). Getting batch vs streaming ingestion right means knowing the latency tiers, the service families, the cost dimensions, and the canonical traps cold.

This study note is built for the data engineer perspective. It covers what data ingestion means in modern AWS architectures, the batch vs streaming dichotomy, latency tiers from sub-second to daily, ETL vs ELT trade-offs, the AWS service decision tree, push vs pull models, fan-in and fan-out patterns, schema-on-read vs schema-on-write, idempotency and exactly-once semantics, the cost dimensions of each pattern, and the exam traps that catch most candidates. By the end the batch vs streaming ingestion decision should be a structured choice you can defend in any architecture review or exam scenario.

What Is Data Ingestion in AWS Data Engineering?

Data ingestion is the act of moving data from a source system into a destination where downstream consumers can analyze it. Sources include operational databases (RDS, DynamoDB, on-premises Oracle), SaaS applications (Salesforce, ServiceNow, Stripe), streaming sources (IoT devices, application event logs, clickstreams), and file shares (NFS, SMB, HDFS, S3 buckets in other accounts). Destinations include the data lake (S3), the data warehouse (Redshift), the search index (OpenSearch), the operational store (DynamoDB, Aurora), and downstream streams (another Kinesis stream, an SQS queue, a Lambda consumer).

The ingestion layer is the boundary between the operational world and the analytical world. Get it wrong and every downstream pipeline pays the cost — late data, missing records, schema drift, duplicate rows, exploding S3 bills, alert fatigue from flaky jobs. Get it right and downstream pipelines are boring, predictable, and cheap.

The Two Fundamental Patterns

Every ingestion pipeline lands in one of two patterns or a hybrid of both. Batch ingestion moves data in chunks on a schedule — every hour, every night, every week. Streaming ingestion moves data continuously as events occur — record-by-record or in micro-batches measured in seconds. The choice is driven by the latency requirement of downstream consumers, the throughput characteristics of the source, the cost budget, and the operational complexity the team can afford.

Batch ingestion is scheduled, bounded, and high-throughput; streaming ingestion is continuous, unbounded, and low-latency. A batch job has a defined start time, a defined end time, processes a known volume of data, and reports success or failure as a single unit. A streaming job runs forever, consumes an open-ended event stream, and is measured in records per second and end-to-end latency rather than total runtime. On the DEA-C01 exam, a scenario that says "ingest yesterday's transactions" or "process the daily file drop" is batch; a scenario that says "react within seconds" or "score events as they arrive" is streaming.

Batch Ingestion — Scheduled Jobs and Micro-Batch

Batch ingestion is the older, simpler, and still dominant pattern in enterprise data engineering. Most analytical workloads do not need sub-second freshness — yesterday's sales report does not change because of a transaction five minutes ago. Batch is cheaper per GB ingested, easier to reason about, and easier to recover when something fails.

Classic Batch — Hourly, Daily, Weekly

Classic batch jobs run on a fixed schedule. Examples: at 02:00 every day, AWS Glue reads yesterday's transaction file from an SFTP server via AWS Transfer Family, transforms it with PySpark, and writes Parquet partitions to the S3 data lake. At the top of every hour, an EMR step copies the last hour of CDC records from DMS-replicated S3 to Redshift via COPY. Once a week, a SageMaker Processing job re-aggregates customer behavior features for the modeling team.

The key characteristics: predictable resource usage, easy to monitor (success or failure of a known job), cheap relative to streaming for the same data volume, and easy to backfill (just re-run with a different date range). The downside: data is at least one batch interval stale, which is wrong for fraud detection, real-time personalization, IoT alerting, and operational dashboards that demand sub-minute freshness.

Micro-Batch — The Hybrid Sweet Spot

Micro-batch sits between classic batch and pure streaming. Instead of one large job per hour, a micro-batch pipeline processes small chunks (say, every 60 seconds) using the same batch tooling. AWS Glue Streaming jobs, Spark Structured Streaming on EMR, and even scheduled Lambda invocations every minute fall into this category. Latency drops to seconds-to-minutes while keeping batch's operational simplicity.

Micro-batch is the pragmatic answer for most "near-real-time" requirements. It is also the underlying engine for Kinesis Data Firehose, which buffers records for 60-900 seconds before writing to S3 in micro-batches.

When Batch Beats Streaming

Pick batch when downstream consumers are reports, dashboards refreshed hourly or slower, machine learning training pipelines, monthly reconciliations, regulatory submissions, or any workload where freshness measured in hours is acceptable. Batch is also right when source data arrives in batches — a daily file drop or a nightly database extract — because converting it to a stream first adds zero value and significant complexity.

Streaming Ingestion — Continuous Event Flow

Streaming ingestion treats data as an unbounded sequence of events that must be processed as they arrive. The shift in mindset is from "process this finite chunk" to "react to every event with bounded latency forever."

Sub-Second Latency and Unbounded Datasets

Streaming pipelines target sub-second to single-digit-second end-to-end latency from event creation to consumer visibility. The dataset has no end — IoT sensors emit forever, clickstream events flow forever, transaction logs append forever. Operational metrics shift from "did the job succeed" to "what is my iterator age, what is my consumer lag, what is my message-in-flight count."

Real Use Cases for Streaming

Real-time fraud scoring: a transaction enters Kinesis Data Streams, a Lambda or Flink consumer scores it against a deployed ML model, and an approve-or-decline decision returns within 200 milliseconds. IoT telemetry: thousands of devices push readings to Kinesis, an aggregator computes rolling averages, and an alarm fires when a temperature crosses a threshold. Clickstream personalization: a user clicks a product, the event flows through MSK, Flink updates the user's session profile, and the next page render uses the updated profile.

True real-time (sub-second) latency requires Kinesis Data Streams or Amazon MSK with custom consumers — not Kinesis Data Firehose, not Lambda alone reading from a queue, and never Glue scheduled jobs. Kinesis Data Firehose has a minimum buffering interval of 60 seconds (and even 0 second buffering still introduces some latency on the delivery path), making it near-real-time at best. Lambda is suitable as a real-time consumer of Kinesis or MSK, but you must size concurrency, batch size, and parallelization factor correctly to maintain sub-second iterator age. The DEA-C01 exam plants this trap repeatedly: a question says "real-time" and includes a Firehose answer; the right answer is Data Streams plus a custom consumer.

ETL vs ELT — Transform Before or After Load

The transformation timing decision sits inside the broader ingestion architecture and shapes which AWS services you reach for.

ETL — Transform Before Load

In ETL (Extract, Transform, Load), the transformation happens between the source and the destination. AWS Glue, EMR, and Lambda are common transformation engines that read raw data, apply schema enforcement, joins, aggregations, and validations, then write clean curated data to the destination. The destination is "presentation-ready" — Redshift gold tables, Athena-queryable Parquet partitions in S3.

ETL fits when the destination has expensive compute (Redshift), when downstream consumers expect a clean schema, when raw data must be discarded for compliance, or when the transformation is too complex to push to query time.

ELT — Load Then Transform

In ELT (Extract, Load, Transform), raw data lands first in the destination, and transformation runs as a downstream step. Redshift COPY loads raw data, then materialized views or scheduled SQL jobs transform it. S3 holds the raw zone, then Glue or Athena CTAS produces curated zones. EMR Hudi/Iceberg tables let you upsert raw data and build curated views with SQL.

ELT fits when storage is cheap (S3 always, Redshift RA3 with managed storage), when compute is elastic and only paid when querying, when data scientists want raw data access for exploration, and when downstream transformations are simpler than the orchestration of a full ETL job.

The Modern Lake House Hybrid

Most production AWS pipelines are hybrid. Raw data lands in S3 raw zone via DataSync, AppFlow, or Kinesis Firehose (E + L). Glue or EMR builds curated Parquet partitions (T). Redshift Spectrum queries the curated zone, or Redshift COPY pulls a subset into hot tables for sub-second BI queries (E + L again). The boundaries between ETL and ELT blur — what matters for the exam is recognizing which service plays which role.

Latency Tiers — Real-Time, Near-Real-Time, Batch

The latency requirement is the first filter on AWS service selection. Memorize three tiers.

Real-Time (under 1 second)

End-to-end from event creation to consumer visibility in under one second. Services: Kinesis Data Streams with custom consumers, Amazon MSK with custom Kafka consumers, Amazon Managed Service for Apache Flink for stream processing, Lambda with provisioned concurrency, MemoryDB for state. Use cases: fraud scoring, real-time bidding, online recommendations, IoT alerting.

Near-Real-Time (seconds to minutes)

End-to-end latency in the seconds-to-minutes range. Services: Kinesis Data Firehose (60-900 second buffer), Glue Streaming jobs, Lambda triggered by S3 PUT events, EventBridge Pipes with batching. Use cases: operational dashboards, log aggregation, anomaly detection, marketing event funnels.

Batch (hours to days)

End-to-end latency in the hours-to-days range. Services: AWS Glue scheduled jobs, EMR scheduled clusters, AWS Batch, scheduled Step Functions workflows, EventBridge Scheduler triggering Lambda or Glue, DMS full-load tasks. Use cases: nightly reporting, regulatory ETL, ML training data refresh, daily reconciliation.

Kinesis Data Firehose answers say near-real-time with 60 second minimum buffering — never pick Firehose when the question requires real-time. Firehose buffers records by size (1-128 MB) or by time (60-900 seconds) before writing to S3, Redshift, OpenSearch, or HTTP endpoints. Even with the minimum 60 second buffer, end-to-end latency from PutRecord to S3 visibility is at least one minute. Candidates burn questions by reading "managed delivery to S3" and choosing Firehose, missing that the scenario specified "process events within 5 seconds." The answer in that case is Kinesis Data Streams plus a Lambda consumer or Managed Service for Apache Flink, with Firehose only acceptable as a tee to S3 archive.

Service Selection Decision Tree

The AWS data engineering ingestion landscape is wide. A structured decision tree filters it down to the right service in seconds.

Step 1 — Is It Streaming or Batch?

If events are continuous and downstream needs sub-minute reaction: streaming. If events arrive in scheduled chunks or analytical lag is acceptable: batch.

Step 2 — For Streaming, Pick the Right Service Family

Sub-second latency with Kafka ecosystem (existing on-premises Kafka, Kafka Connect, Schema Registry expectations): MSK. Sub-second latency with AWS-native simplicity: Kinesis Data Streams. Near-real-time delivery to S3/Redshift/OpenSearch with no custom consumer code: Kinesis Data Firehose. Stateful event-time stream processing with windowing: Managed Service for Apache Flink. Lightweight event routing with filtering: EventBridge Pipes.

Step 3 — For Batch, Pick by Source Type and Volume

Database CDC and full-load: AWS DMS. SaaS applications (Salesforce, ServiceNow, Stripe): AppFlow. File shares (NFS/SMB/HDFS): DataSync. Petabyte offline transfer: Snowball. Transformation-heavy ETL: AWS Glue or EMR. Lightweight serverless processing: Lambda triggered by S3 events or EventBridge Scheduler.

Step 4 — Consider Cost and Operational Overhead

Self-managed Kafka on EC2: cheapest per byte at very high throughput, highest operational burden. MSK Provisioned: managed Kafka, brokers you size. MSK Serverless: zero broker management, on-demand pricing. Kinesis Data Streams: per-shard pricing, on-demand mode for unpredictable workloads. Kinesis Data Firehose: per-record pricing, fully managed. Glue: DPU-hours billing, scales for batch. Lambda: per-invocation plus duration, cheapest for low-volume event-driven.

Push vs Pull Ingestion Models

Beyond batch vs streaming, ingestion architectures split by who initiates the data movement.

Push Models — Source Drives Delivery

In a push model, the source actively delivers data to the destination. Producers call PutRecord on Kinesis Data Streams, write to MSK topics, post to API Gateway, or trigger S3 PUT events that fire Lambda. Push is ideal for streaming and for source systems where the data engineer cannot run a pull agent.

Pull Models — Destination Drives Extraction

In a pull model, the destination polls or scans the source on a schedule. Glue crawlers walk an S3 prefix, AppFlow pulls Salesforce records, DMS replicates RDS into S3, AWS Transfer Family receives uploaded files. Pull is the default for batch and for source systems with stable APIs.

Hybrid Patterns Are Common

Many real architectures combine both: DynamoDB Streams (push from DDB) into Lambda (pull from stream) into Kinesis Firehose (push to S3). Recognizing the hybrid nature is part of architecting at the DEA-C01 level.

Fan-In and Fan-Out Patterns

Streaming architectures rarely have one producer and one consumer. The fan-in and fan-out patterns are foundational.

Fan-In — Many Producers to One Stream

Fan-in collapses many small producers into one consolidated stream. Thousands of IoT devices write to one Kinesis Data Stream, where shards distribute load by partition key. Hundreds of microservices publish events to one MSK topic. Fan-in simplifies downstream consumption and centralizes monitoring.

Fan-Out — One Stream to Many Consumers

Fan-out lets multiple independent consumers process the same stream without interference. Kinesis Data Streams supports Enhanced Fan-Out (EFO), giving each consumer a dedicated 2 MB/s read pipe per shard. MSK consumer groups let independent applications read the same topic at their own pace. Without fan-out, consumers compete for shared shard read throughput and slow each other down.

Use Enhanced Fan-Out (EFO) on Kinesis Data Streams when more than two consumers read the same stream — without EFO, consumers share the 2 MB/s read throughput per shard. Enhanced Fan-Out gives each registered consumer a dedicated 2 MB/s pipe per shard with sub-200ms propagation latency, eliminating the shared read bottleneck. EFO costs more per consumer but is the right answer when the stream feeds multiple independent applications. The DEA-C01 exam plants scenarios where four consumers read the same stream and one falls behind — the fix is EFO, not adding shards. Adding shards increases write throughput but does not fix the per-consumer read bottleneck on a shared stream.

Schema-on-Read vs Schema-on-Write

When does the schema get enforced? The choice has cost and reliability implications.

Schema-on-Write — Enforce at Ingestion

Schema-on-write enforces the schema at ingestion time. Records that do not conform are rejected before they reach the destination. Redshift COPY with strict typing, Glue Schema Registry validating Kafka producers, and DMS with column-level mapping all enforce schema-on-write. Benefits: clean downstream data, fail-fast on producer bugs, predictable analytical queries. Drawbacks: schema evolution requires coordinated producer and consumer changes; rejected records need a dead-letter strategy.

Schema-on-Read — Defer to Query Time

Schema-on-read accepts raw data into the lake and applies schema at query time. S3 raw zone with Athena (Glue Data Catalog applies schema for SELECT), Iceberg tables with schema evolution, and Hudi tables all defer enforcement. Benefits: fast ingestion, no producer-consumer coupling, schema can evolve without breaking historical data. Drawbacks: query-time errors when raw data is malformed, harder to alert on producer drift.

When to Choose Each

Schema-on-write fits regulated workloads, low-latency analytics, and stable mature pipelines. Schema-on-read fits exploratory data science, schema-flexible sources (JSON APIs, semi-structured logs), and lake architectures where the cost of rejecting valid data outweighs the cost of late-detected schema bugs.

Idempotency and Exactly-Once Semantics

Reliable data pipelines require careful handling of retries and duplicates.

Idempotency — Same Operation, Same Result

An idempotent operation produces the same final state regardless of how many times it is applied. Inserting a record keyed by transaction ID is idempotent if duplicate IDs are deduplicated. PutRecord with a sequence number, S3 PutObject (overwrites), and Glue job bookmarks (state-tracked) are idempotent.

Exactly-Once — The Hard Guarantee

Exactly-once semantics guarantee each record is processed exactly one time, even under failure. Kinesis Data Streams plus the Kinesis Client Library (KCL) with checkpointing approximates exactly-once via at-least-once delivery plus consumer-side deduplication. MSK with Kafka transactions and idempotent producers offers exactly-once at the Kafka level. Managed Service for Apache Flink with checkpointing provides exactly-once for stream processing.

Why It Matters

Without exactly-once, a transient consumer crash plus restart can double-count revenue, reapply a charge, or duplicate a fraud alert. The DEA-C01 exam asks scenarios like "consumer crashed mid-batch, what happens to the records" — the answer hinges on whether the architecture has checkpointing, idempotent producers, or only at-least-once delivery.

Cost Dimensions Comparison

Service selection is also a cost optimization exercise.

Per-Shard / Per-PUT Pricing — Kinesis Data Streams

Kinesis Data Streams charges per shard-hour ($0.015/hr in us-east-1) plus per million PUT records. On-demand mode replaces shard-hour with a per-GB ingested rate. Costs scale with throughput; idle streams still incur shard-hour charges in provisioned mode.

Per-Record Pricing — Kinesis Data Firehose

Kinesis Data Firehose charges per GB ingested ($0.029/GB in us-east-1) with no minimum. Format conversion, dynamic partitioning, and VPC delivery add per-GB surcharges. Cheapest path to S3 for low to moderate volume; cost compounds for high-throughput streams.

DPU-Hours — AWS Glue

Glue ETL jobs charge per DPU-hour ($0.44/DPU-hour for standard Spark workers). Glue Streaming and Glue Studio jobs follow the same model. Glue crawlers charge per DPU-hour with a 10-minute minimum. Cost scales with data volume and transformation complexity.

Per-Invocation — Lambda

Lambda charges per request plus per GB-second of duration. For event-driven ingestion at low volume (under millions of events per day), Lambda is the cheapest path. Above that volume, Kinesis or Glue Streaming wins on cost-per-event.

Kinesis Data Streams equals real-time custom consumers, Firehose equals near-real-time managed delivery, Glue equals batch ETL, Lambda equals event-driven serverless. Memorize this four-tuple for the DEA-C01 exam. If the scenario word is "real-time" or "sub-second": Data Streams. If the scenario word is "near-real-time" or "managed delivery to S3/Redshift/OpenSearch": Firehose. If the scenario word is "scheduled batch" or "DPU" or "PySpark": Glue. If the scenario word is "event-driven" or "S3 PUT trigger" or "lightweight": Lambda. Combinations are common in production (Kinesis to Firehose to S3, Lambda fan-out to multiple Glue jobs), but the dominant term in the requirement narrows the answer fast.

Plain-Language Explanation: — Batch vs Streaming Ingestion

The batch vs streaming choice is the kind of trade-off where naming alone misses the engineering reality. Three concrete analogies make the structure stick.

Analogy 1 — The Restaurant Walk-In Delivery vs Live Orders

Picture a busy restaurant. Once a day, a refrigerated truck pulls up to the back door and unloads the day's produce, proteins, and dry goods — that delivery is batch ingestion. The kitchen knows it is coming, has staff ready to receive, has space to stage everything, and can process the entire shipment as one unit. If the truck is two hours late, no one panics; tomorrow's prep just shifts. Now picture a customer at the counter ordering a coffee. The barista takes the order, pulls a shot, steams milk, and hands the cup back within ninety seconds — that ordering flow is streaming ingestion. Orders arrive continuously, each one is processed individually, and a delay of even sixty seconds creates a visible queue.

The walk-in delivery uses a forklift, pallet jacks, large refrigerators — the heavy infrastructure of batch (Glue, EMR, Snowball). The barista station uses a single espresso machine, a milk steamer, a cup rack — the lightweight per-event infrastructure of streaming (Kinesis, Lambda, MSK). Trying to bring the truck up to the counter for each customer is absurd; trying to batch a hundred drink orders before pulling shots also fails. The two ingestion patterns exist because both data flow patterns exist in the real world.

Analogy 2 — The Postal Mail Truck vs Phone Call

Picture two ways to send the same business message. Option one: the postal mail truck — you write the letter, drop it in a mailbox, the postal service collects mail twice a day, sorts it overnight, delivers tomorrow. End-to-end latency: 24-48 hours. Option two: a phone call — you dial, the recipient picks up, you speak, they hear immediately. End-to-end latency: under a second.

The postal system is batch ingestion — high throughput per delivery cycle (thousands of letters in one truck), low cost per piece, predictable delivery windows, easy to recover (lost letter? resend tomorrow). The phone call is streaming ingestion — minimum throughput per call (one conversation), higher cost per minute, sub-second latency, harder to recover (line dropped mid-sentence? you have to redial and remember where you stopped). The choice between mail and phone is not about preference; it is about whether the message must be acknowledged immediately. A monthly bill goes by mail. A medical emergency goes by phone. Picking the wrong channel wastes money or risks lives. The DEA-C01 exam works the same way.

Analogy 3 — The Quarterly Inventory Count vs Live Cash Register

Picture a retail store with two ways of tracking sales. The first is a quarterly inventory count — every three months, the store closes for a weekend, employees physically count every item on the shelves and in the stockroom, and the system reconciles to the official inventory. That is batch ingestion — high accuracy at the moment of measurement, large effort, infrequent, and the data is stale for almost the entire quarter. The second is the live cash register — every time a customer buys an item, the barcode scan immediately deducts inventory and updates the dashboard. That is streaming ingestion — every event captured continuously, dashboards always current, but more infrastructure to maintain (every register networked, every transaction logged).

A small store might run quarterly counts only — cheap, simple, and stale data is acceptable for slow-moving merchandise. A national chain with millions of SKUs runs both — cash registers for live tracking and quarterly counts for reconciliation. The DEA-C01 exam asks scenario questions where the right answer is the cash register pattern (Kinesis) and a wrong answer offers the quarterly count (Glue scheduled). Recognizing which pattern matches the requirement is the core skill.

Common Exam Traps for Batch vs Streaming Ingestion

The DEA-C01 exam plants a consistent set of traps. Memorize all five.

Trap 1 — Firehose for Real-Time Requirements

A scenario says "real-time fraud detection within 1 second" and offers Kinesis Data Firehose as an answer. Wrong. Firehose has a minimum 60-second buffer interval and is near-real-time at best. The right answer is Kinesis Data Streams plus a Lambda or Flink consumer.

Trap 2 — Lambda for Long-Running Batch ETL

A scenario describes a 4-hour transformation job over 100 GB of data and offers Lambda as a serverless answer. Wrong. Lambda has a 15-minute hard timeout. The right answer is AWS Glue or EMR for transformations of that size and duration.

Trap 3 — Glue Crawler Treating Each File as a Separate Table

A scenario describes a single S3 prefix with files at different schema versions and asks why the crawler creates dozens of tables. The cause is the crawler's schema heuristics deciding the files are not similar enough to belong to one table. The fix is the crawler TableGroupingPolicy=CombineCompatibleSchemas setting and a clean partition layout — not removing the crawler.

Trap 4 — Push vs Pull Confusion in Hybrid Architectures

A scenario describes Kinesis Data Streams feeding Lambda which writes to DynamoDB. The exam asks which side is push and which is pull. Producers push to Kinesis (PutRecord); Lambda is invoked by the Kinesis trigger (push from Kinesis perspective, but Lambda's poller pulls from the stream under the hood). The architecture is push end-to-end from the producer's view; pull internally on the Lambda side.

Trap 5 — Confusing Batch with Always Slow

Candidates assume batch always means hours of latency. Wrong. Micro-batch with Glue Streaming or Spark Structured Streaming on EMR delivers seconds-to-minutes latency while keeping batch's operational simplicity. The exam plants scenarios where micro-batch is the right answer for "near-real-time" requirements, with both pure-streaming (Kinesis Data Streams + Flink) and pure-batch (Glue scheduled hourly) as distractors.

Key Numbers and Must-Memorize Facts

Latency Tiers

Real-time: under 1 second end-to-end (Kinesis Data Streams + custom consumer, MSK + custom consumer, Managed Flink)
Near-real-time: seconds to minutes (Kinesis Data Firehose 60-900s buffer, Glue Streaming, Lambda S3 trigger)
Batch: hours to days (Glue scheduled, EMR scheduled, DMS full-load)

Kinesis Data Streams Throughput

1 MB/s write per shard
2 MB/s read per shard (shared) or 2 MB/s read per consumer per shard (Enhanced Fan-Out)
1000 records/second write per shard

Kinesis Data Firehose Buffering

Buffer size: 1-128 MB
Buffer interval: 60-900 seconds (minimum 60 seconds)
Output destinations: S3, Redshift (via S3), OpenSearch, HTTP endpoints, Splunk, Snowflake

Lambda Limits for Ingestion

15-minute hard timeout
10 GB max memory
6 MB synchronous payload, 256 KB asynchronous payload

Glue ETL Pricing

$0.44 per DPU-hour (standard worker)
10-minute minimum billing per job
G.1X: 1 DPU, G.2X: 2 DPU, G.4X: 4 DPU, G.8X: 8 DPU per worker

DEA-C01 exam priority — Batch vs Streaming Ingestion Patterns. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — Batch vs Streaming Ingestion Top Questions

Q1 — How do I decide between Kinesis Data Streams and Kinesis Data Firehose?

The decision hinges on three factors: latency, custom consumer code, and destination flexibility. Pick Kinesis Data Streams when the latency requirement is sub-second, when you need custom consumer logic (filtering, enrichment, branching to multiple destinations with different rules), or when you need long retention (up to 365 days). Pick Kinesis Data Firehose when latency of 60+ seconds is acceptable, when the destination is S3/Redshift/OpenSearch/HTTP and you do not want to write consumer code, and when you want managed buffering, format conversion (JSON to Parquet), and dynamic partitioning. The DEA-C01 exam plants this distinction repeatedly — read the latency requirement first, then pick.

Q2 — When is Lambda inappropriate for data ingestion?

Lambda is inappropriate when any of these apply: the processing duration exceeds 15 minutes (Lambda's hard timeout); the memory requirement exceeds 10 GB; the payload exceeds 6 MB synchronously or 256 KB asynchronously; the workload is sustained high throughput where per-invocation cost beats per-shard or per-DPU cost; or the consumer needs sticky state across invocations (Lambda is stateless). For long ETL, use Glue or EMR. For very high throughput streaming, use Kinesis with Flink. For lightweight event-driven micro-tasks, Lambda remains the right choice.

Q3 — What is the difference between micro-batch and streaming?

Micro-batch processes small chunks (60 seconds to several minutes) using batch tooling — Glue Streaming jobs, Spark Structured Streaming, scheduled Lambda. Streaming processes record-by-record (or in tiny per-shard batches) using purpose-built streaming engines — Kinesis Data Streams + KCL consumer, MSK consumer, Managed Flink. Micro-batch latency floor is the batch interval (typically 60 seconds); streaming latency floor is sub-second. Operationally, micro-batch is simpler — each batch succeeds or fails as a unit, retry semantics are clear, idempotency is easy. Streaming is more complex — checkpointing, watermarks, late-arriving data, exactly-once semantics. The exam asks scenario questions where "near-real-time" plus "minimal operational complexity" points to micro-batch (Glue Streaming) and "real-time" plus "stateful processing" points to Flink.

Q4 — How do I handle backpressure in a streaming pipeline?

Backpressure happens when consumers cannot keep up with producers. Symptoms: rising iterator age in Kinesis, growing consumer lag in MSK, throttled PutRecord calls. Fixes depend on the bottleneck. If the consumer is the bottleneck: add Lambda concurrency, increase Flink parallelism, scale out KCL workers, or use Enhanced Fan-Out for parallel consumer paths. If the stream is the bottleneck: add shards (KDS), add partitions (MSK), or switch to on-demand capacity. If the producer is bursting: add a buffering layer (SQS in front of Lambda, MSK as a buffer in front of consumers). The DEA-C01 exam tests whether candidates understand iterator age and consumer lag as the diagnostic signals — read those first, then pick the fix.

Q5 — What is exactly-once and why is it hard?

Exactly-once semantics guarantee each record is processed one time, even with retries and failures. It is hard because at-least-once (the default in most streaming systems) plus consumer crash equals duplicate processing. True exactly-once requires either idempotent operations (consumer can safely apply the same record twice with no side effect) or transactional coordination (consumer and destination commit together, or both roll back). Kinesis with KCL plus a deduplication store (DynamoDB by sequence number) approximates exactly-once. MSK with Kafka transactions and idempotent producers offers exactly-once at the Kafka layer. Managed Flink with checkpointing and end-to-end exactly-once sinks (S3, Kafka) provides true exactly-once for stream processing. The exam plants scenarios like "consumer restarted mid-batch, prevent duplicates" — the answer hinges on whether the architecture has these guarantees.

Q6 — When should I pick AWS Glue over EMR for batch ingestion?

Pick Glue when the workload is standard ETL (read source, transform, write target) with no specialized framework requirement, when you want serverless with no cluster management, when DPU-based pricing matches your usage profile, or when you want native integration with the Glue Data Catalog and Data Quality. Pick EMR when you need custom Spark configuration beyond Glue's options, when the workload uses Hive/HBase/Presto/Trino directly, when you need long-running cluster economics for very high volume, or when open table formats (Iceberg, Hudi, Delta Lake) need EMR's mature integration. For most DEA-C01 scenarios, Glue wins on simplicity and managed ops; EMR wins when the requirement explicitly calls for custom Spark code or non-Spark frameworks.

Q7 — How do I choose between push and pull ingestion models?

Push when the source is event-driven (IoT, application events, S3 PUT notifications), when you do not control the source enough to deploy a pull agent, or when latency requires the source to deliver immediately. Pull when the source has a stable API (SaaS, RDBMS), when you control the schedule, or when batch is acceptable. The DEA-C01 exam asks scenario questions like "ingest from Salesforce daily" (pull, AppFlow), "react to S3 uploads" (push, S3 event notification + Lambda), or "stream from on-premises Kafka" (push, MSK Connect or self-managed connector). Recognizing the source type points at the ingestion model fast.

What Is Data Ingestion in AWS Data Engineering?

The Two Fundamental Patterns

Batch Ingestion — Scheduled Jobs and Micro-Batch

Classic Batch — Hourly, Daily, Weekly

Micro-Batch — The Hybrid Sweet Spot

When Batch Beats Streaming

Streaming Ingestion — Continuous Event Flow

Sub-Second Latency and Unbounded Datasets

Real Use Cases for Streaming

ETL vs ELT — Transform Before or After Load

ETL — Transform Before Load

ELT — Load Then Transform

The Modern Lake House Hybrid

Latency Tiers — Real-Time, Near-Real-Time, Batch

Real-Time (under 1 second)

Near-Real-Time (seconds to minutes)

Batch (hours to days)

Service Selection Decision Tree

Step 1 — Is It Streaming or Batch?

Step 2 — For Streaming, Pick the Right Service Family

Step 3 — For Batch, Pick by Source Type and Volume

Step 4 — Consider Cost and Operational Overhead

Push vs Pull Ingestion Models

Push Models — Source Drives Delivery

Pull Models — Destination Drives Extraction

Hybrid Patterns Are Common

Fan-In and Fan-Out Patterns

Fan-In — Many Producers to One Stream

Fan-Out — One Stream to Many Consumers

Schema-on-Read vs Schema-on-Write

Schema-on-Write — Enforce at Ingestion

Schema-on-Read — Defer to Query Time

When to Choose Each

Idempotency and Exactly-Once Semantics

Idempotency — Same Operation, Same Result

Exactly-Once — The Hard Guarantee

Why It Matters

Cost Dimensions Comparison

Per-Shard / Per-PUT Pricing — Kinesis Data Streams

Per-Record Pricing — Kinesis Data Firehose

DPU-Hours — AWS Glue

Per-Invocation — Lambda

Plain-Language Explanation: — Batch vs Streaming Ingestion

Analogy 1 — The Restaurant Walk-In Delivery vs Live Orders

Analogy 2 — The Postal Mail Truck vs Phone Call

Analogy 3 — The Quarterly Inventory Count vs Live Cash Register

Common Exam Traps for Batch vs Streaming Ingestion

Trap 1 — Firehose for Real-Time Requirements

Trap 2 — Lambda for Long-Running Batch ETL

Trap 3 — Glue Crawler Treating Each File as a Separate Table

Trap 4 — Push vs Pull Confusion in Hybrid Architectures

Trap 5 — Confusing Batch with Always Slow

Key Numbers and Must-Memorize Facts

Latency Tiers

Kinesis Data Streams Throughput

Kinesis Data Firehose Buffering

Lambda Limits for Ingestion

Glue ETL Pricing

FAQ — Batch vs Streaming Ingestion Top Questions

Q1 — How do I decide between Kinesis Data Streams and Kinesis Data Firehose?

Q2 — When is Lambda inappropriate for data ingestion?

Q3 — What is the difference between micro-batch and streaming?

Q4 — How do I handle backpressure in a streaming pipeline?

Q5 — What is exactly-once and why is it hard?

Q6 — When should I pick AWS Glue over EMR for batch ingestion?

Q7 — How do I choose between push and pull ingestion models?

Further Reading — Official AWS Documentation

Official sources

More DEA-C01 topics