DynamoDB, Aurora, and Vector Stores - DEA-C01 Data Engineer Study Notes

Q: Q2 — How do I run analytics on DynamoDB without affecting the production table?

Two patterns: continuous and periodic. Continuous : enable DynamoDB Streams, attach a Lambda or use the DynamoDB-to-Kinesis option, write changes to Firehose to S3 in Parquet, register the table in the Glue Data Catalog, query with Athena or load into Redshift. The pipeline runs in near-real-time and never touches the production table. Periodic : use DynamoDB on-demand export to S3 for a point-in-time snapshot in DynamoDB JSON, transform to Parquet with Glue, query with Athena. The export does not consume table read capacity. For both patterns, do NOT run analytical queries against the live DynamoDB table — that consumes RCU, can throttle production reads, and is expensive at scale. The DEA-C01 exam plants this as a "production OLTP plus analytics" scenario with the wrong-answer "use Athena federated query against DynamoDB" — that works for ad hoc but is wrong for production analytics.

Q: Q4 — How do I choose between pgvector, MemoryDB vector search, and OpenSearch k-NN?

Decision tree by workload size and integration: pgvector is right when vectors live alongside relational data in an existing PostgreSQL application, vector count is under ~10M, and operational simplicity matters (one database, one connection pool). MemoryDB vector search is right when sub-millisecond p99 latency is required, vector count is in the tens of millions, and HNSW recall trade-offs are acceptable. OpenSearch k-NN is right when vector count exceeds 100M, hybrid keyword-plus-vector search is needed (RAG with metadata filtering), or the team already runs OpenSearch for log analytics. For RAG pipelines specifically, OpenSearch is the canonical choice because it supports the "filter by document type, then rank by semantic similarity" pattern in a single query. Vector index types: HNSW for fastest queries with stable corpus, IVF for memory-constrained or rapidly-changing corpora.

DynamoDB, Aurora, and vector stores form the operational data layer that AWS data engineers wire into pipelines whenever a workload needs sub-millisecond reads, transactional writes, or similarity search over embeddings. On the DEA-C01 exam Domain 2 Task 2.1 routinely tests a single decision: given a workload description, which store is right? The trap is that DynamoDB, Aurora, and vector stores all look like databases, but they have different latency profiles, different consistency models, different scaling primitives, and entirely different cost shapes. A data engineer who picks Aurora for a 100 K writes-per-second IoT ingestion endpoint will burn money and miss SLAs; a data engineer who picks DynamoDB for a complex 10-table relational join will hit a brick wall on query expressivity.

This guide walks DynamoDB, Aurora, and the vector-store family (Aurora pgvector, MemoryDB, OpenSearch k-NN) through the lens of Data Engineer / MLOps work — choosing data stores, wiring CDC into pipelines, exporting to data lakes, and feeding ML serving paths. It covers DynamoDB capacity modes and Streams, Aurora MySQL versus PostgreSQL versus Aurora Serverless v2, pgvector for embedding search, MemoryDB for sub-millisecond vector lookup, OpenSearch k-NN for hybrid keyword-plus-vector retrieval, the OLTP-versus-OLAP boundary that defines when DynamoDB or Aurora is wrong, and the canonical DEA-C01 traps planted around DynamoDB analytics, vector index types, and zero-ETL integration.

Data Store Selection Framework — OLTP, OLAP, Streaming, Vector

Every DEA-C01 store-selection question can be answered by mapping the workload onto one of four patterns.

OLTP — Transactional Reads And Writes

Online Transaction Processing means short-lived reads and writes with single-digit millisecond latency, predictable per-request keys, and high concurrency. Examples: shopping cart updates, user profile reads, IoT device state, order placement. DynamoDB and Aurora are both OLTP stores; DynamoDB scales horizontally without limit while Aurora scales vertically and via read replicas with stricter relational semantics.

OLAP — Analytical Aggregations

Online Analytical Processing means long-running reads that aggregate over millions or billions of rows, joins across many tables, and column-wise scans. Redshift, Athena, and EMR are OLAP engines. Putting an OLAP query against DynamoDB is the canonical wrong-answer trap.

Streaming — Continuous Event Flow

Streaming workloads land events at high rates with sub-second freshness expectations. Kinesis Data Streams, MSK, and DynamoDB Streams are the streaming primitives. DynamoDB Streams in particular doubles as a CDC source feeding Lambda or Kinesis pipelines.

Vector — Similarity Search Over Embeddings

Vector workloads store high-dimensional float arrays (embeddings from a model like Bedrock Titan or OpenAI) and serve approximate-nearest-neighbor (ANN) queries to find the K most similar vectors. Aurora pgvector, MemoryDB vector search, and OpenSearch k-NN are the AWS options. RAG (retrieval-augmented generation) and recommender systems are the canonical use cases.

Amazon DynamoDB — Key-Value And Document NoSQL

DynamoDB is the AWS-native fully managed NoSQL service and the default OLTP store on the exam.

Latency And Throughput Profile

DynamoDB delivers single-digit millisecond P99 read and write latency at any scale — the AWS marketing claim is operationally true because the storage engine is partition-sharded across SSDs and the API is designed to route every request to a single partition. There is no cluster to size, no minimum cost beyond per-request billing.

Capacity Modes — On-Demand vs Provisioned

DynamoDB tables run in one of two capacity modes. On-demand auto-scales reads and writes per request and bills per request unit; ideal for unpredictable workloads or new applications without traffic history. Provisioned reserves read and write capacity units (RCU and WCU) per second, costs less per request at sustained load, and supports auto-scaling rules. The DEA-C01 trap is the cost calculation: a workload with steady predictable traffic costs roughly 5x to 7x more on on-demand than equivalent provisioned. Switch from on-demand to provisioned once traffic stabilizes.

Partition Keys And Hot Partitions

Every DynamoDB item has a partition key (and optionally a sort key forming a composite key). DynamoDB hashes the partition key to assign the item to a partition. A workload that writes 90 percent of items to the same partition key creates a hot partition that throttles regardless of provisioned capacity — the canonical anti-pattern. Choose high-cardinality partition keys (user_id, device_id, order_id) and avoid low-cardinality keys (region, status, date).

Global Tables — Multi-Region Active-Active

DynamoDB Global Tables replicate writes asynchronously across regions with last-writer-wins conflict resolution. Use cases: globally distributed user bases needing low-latency reads worldwide, disaster recovery without manual failover. The exam plants a trap by suggesting Global Tables for transactional consistency across regions — wrong, the model is eventual.

DynamoDB As An OLTP Source — Not An Analytics Store

DynamoDB is excellent at point lookups and range scans within a partition. It is poor at full-table aggregations, multi-table joins, and ad hoc analytics. The right pattern: DynamoDB serves the application's transactional reads, and downstream pipelines export data to S3 (via DynamoDB export to S3 or DynamoDB Streams) for analytics in Athena, Redshift, or EMR.

DynamoDB is a key-value and document NoSQL database with single-digit millisecond P99 latency, partition-sharded storage, and per-request billing — designed for OLTP workloads with high-cardinality access patterns, not for ad hoc analytics. The DEA-C01 exam plants this as both a positive and negative scenario: positive ("the application needs sub-10ms reads at 100K writes per second" => DynamoDB), negative ("the marketing team needs to run aggregation queries on the customer table" => not DynamoDB, export to S3 + Athena or Redshift instead). The right architecture pattern is DynamoDB for the transactional path plus DynamoDB Streams or zero-ETL export to S3 for the analytical path. Treating DynamoDB as a general-purpose database is the canonical wrong answer.

DynamoDB Streams — CDC Source For Pipelines

DynamoDB Streams is the change data capture feed that turns DynamoDB into a streaming source.

How It Works

When enabled, DynamoDB Streams captures every item-level write (insert, update, delete) into a 24-hour ordered log per partition. Consumers — typically Lambda functions or Kinesis Data Streams via the DynamoDB-to-Kinesis connector — read the stream and react to changes. Stream view types control what each record contains: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, or NEW_AND_OLD_IMAGES.

Lambda Trigger Pattern

The most common DEA-C01 pattern: DynamoDB Streams triggers a Lambda function that fans out item changes to downstream destinations — writing to S3 for analytics, sending SNS notifications, updating an OpenSearch index for full-text search. Lambda handles retries, batching (up to 10 K records per invocation), and parallelization (one concurrent invocation per shard).

Kinesis Data Streams For DynamoDB

For high-volume tables or longer retention, DynamoDB can publish item changes directly to a Kinesis Data Stream (separate from DynamoDB Streams). This gives you the full Kinesis ecosystem — multi-day retention, multiple consumers via Enhanced Fan-Out, integration with Managed Service for Apache Flink — at the cost of Kinesis billing on top of DynamoDB.

DynamoDB Export To S3

For point-in-time analytics rather than continuous CDC, DynamoDB supports a one-shot export to S3 in DynamoDB JSON or ION format with no impact on table performance. Pair with Glue or Athena to run analytics, then schedule daily exports for historical snapshots.

Zero-ETL Integration

The newer DynamoDB zero-ETL integration with OpenSearch and Redshift replicates table changes continuously without writing pipeline code. The exam guide explicitly mentions zero-ETL — DynamoDB to OpenSearch for search, DynamoDB to Redshift for analytics.

Amazon Aurora — Relational With MPP-Style Storage

Aurora is the MySQL- and PostgreSQL-compatible managed relational database that combines familiar SQL semantics with cloud-native storage.

Architecture — Decoupled Storage And Compute

Aurora's underlying storage is a distributed, log-structured volume replicated across three Availability Zones with six copies of every page. Compute nodes (writer and up to 15 read replicas) read from the shared storage layer, so adding read replicas does not duplicate storage cost. Failover swaps a read replica into the writer role in 30 seconds or less.

Aurora MySQL vs Aurora PostgreSQL

Aurora MySQL is MySQL 8.0 compatible and runs MySQL workloads with auto-scaling storage and faster recovery. Aurora PostgreSQL is PostgreSQL 14/15/16 compatible and is the path for workloads needing PostgreSQL features (advanced types, extensions like pgvector, JSONB indexes). Pick by application compatibility — neither is universally faster.

Aurora Serverless v2

Aurora Serverless v2 auto-scales compute capacity in 0.5 ACU (Aurora Capacity Unit) increments based on load, scaling from 0.5 ACU minimum to 128 ACU. Pricing is per-ACU-second. Use case: variable-load applications, dev/test, multi-tenant SaaS where each tenant's database has unpredictable traffic. Aurora Serverless v2 supports the same features as provisioned Aurora — read replicas, Global Database, pgvector — without manual capacity management.

Aurora Global Database

Aurora Global Database replicates a primary cluster to up to five secondary regions with sub-second replication lag and one-minute failover RPO. Use cases: global read scaling, regional DR. Unlike DynamoDB Global Tables which is active-active, Aurora Global Database is single-writer with read-only secondaries.

Aurora Zero-ETL Integration With Redshift

Aurora MySQL and Aurora PostgreSQL support zero-ETL integration that replicates transactional changes to Redshift continuously. The replication is managed — you do not run a Glue or DMS job. The exam plants this as the right answer for "near-real-time analytics on transactional data without ETL pipeline overhead."

Aurora vs RDS

RDS is the older managed service that runs MySQL, PostgreSQL, MariaDB, Oracle, SQL Server. Aurora is AWS-native, MySQL/PostgreSQL only, and offers better performance, faster failover, and decoupled storage. RDS is the right answer when the workload requires Oracle, SQL Server, or MariaDB; Aurora is the right answer otherwise.

Aurora is the AWS-native relational store with decoupled compute-storage architecture and is the default DEA-C01 answer for OLTP workloads needing SQL semantics; DynamoDB is the answer when the workload needs NoSQL key-value access at scale without join requirements. Aurora supports up to 15 read replicas sharing a single storage volume, Aurora Serverless v2 for variable load, and zero-ETL integration with Redshift for near-real-time analytics. The exam plants the trap by suggesting RDS for new PostgreSQL workloads — Aurora PostgreSQL is the better answer because of cloud-native storage. RDS is the right answer only when Oracle, SQL Server, or MariaDB compatibility is required.

Aurora PostgreSQL With pgvector — Vector Similarity Search

The pgvector extension turns Aurora PostgreSQL into a vector database for ML retrieval workflows.

What pgvector Does

pgvector adds a vector column type to PostgreSQL and supports <-> (L2 distance), <=> (cosine distance), and <#> (inner product) operators for similarity search. Combined with PostgreSQL indexes (HNSW or IVFFlat), you can run approximate-nearest-neighbor queries directly in SQL: SELECT * FROM documents ORDER BY embedding <=> '[0.1, 0.2, ...]' LIMIT 10.

When pgvector Is The Right Answer

pgvector is correct when the application already runs on Aurora PostgreSQL and the vector workload is a feature of the existing data model (product table with embedding column, document table with chunk embeddings). It avoids the operational overhead of running a separate vector database. Trade-off: pgvector is not as fast as MemoryDB or as scalable as OpenSearch k-NN at very large vector counts.

When pgvector Is Wrong

For workloads with hundreds of millions of vectors, sub-millisecond latency requirements, or complex hybrid keyword-plus-vector search, OpenSearch k-NN or MemoryDB is the better choice. The exam tests this by giving scenario sizes — for "100 K vectors with cosine search alongside transactional data" pgvector wins; for "1 B vectors with sub-ms p99" MemoryDB or OpenSearch wins.

Amazon MemoryDB For Redis — Durable Vector Search

MemoryDB is the Redis-compatible, durable in-memory database that AWS positions as both a cache replacement and a low-latency vector store.

MemoryDB vs ElastiCache

ElastiCache for Redis is a non-durable cache where data may be lost on node failure. MemoryDB is durable — writes are committed to a transactional log replicated across multiple AZs before being acknowledged, so MemoryDB can serve as a primary database, not just a cache. Both are Redis-compatible; MemoryDB is the right answer when durability matters.

Vector Search With HNSW

MemoryDB supports vector indexes using the HNSW (Hierarchical Navigable Small World) algorithm. HNSW builds a multi-layer graph where each node has a few long-range edges and many short-range edges, enabling sub-millisecond ANN search at very high recall. MemoryDB vector search is the right answer for AI applications needing real-time embedding lookup at production-grade latency.

Vector Index Types — HNSW vs IVF

The DEA-C01 exam guide explicitly mentions vector index types. HNSW is graph-based, fastest at query time, larger memory footprint, slower to build. IVF (Inverted File Index) clusters vectors into Voronoi cells and searches the nearest cells — smaller memory, faster to build, lower recall at the same speed than HNSW. The trade-off question: HNSW for low-latency online serving with stable corpus, IVF for memory-constrained or rapidly-changing corpora.

Memorize the AWS vector store decision tree. Aurora PostgreSQL pgvector is correct when vectors live alongside transactional data in an existing PostgreSQL workload, up to ~10M vectors. MemoryDB is correct when sub-millisecond p99 latency over millions of vectors with HNSW indexes is required. OpenSearch Service k-NN is correct when hybrid keyword-plus-vector search at hundreds of millions of vectors is needed (RAG pipelines, semantic search). Bedrock Knowledge Bases is the managed RAG layer that internally chooses one of these stores. Vector index types: HNSW = graph-based, fastest queries, larger memory; IVF = clustered, smaller memory, lower recall. DEA-C01 Task 2.1.8 explicitly tests vector index types — memorize HNSW vs IVF trade-off.

Amazon OpenSearch Service k-NN — Hybrid Search

OpenSearch Service is the managed Elasticsearch fork, and its k-NN plugin enables vector search for ML retrieval.

k-NN Plugin

OpenSearch k-NN adds a knn_vector field type and supports HNSW, IVF, and Lucene-engine variants. Queries combine vector similarity with keyword filtering — match on text fields plus knn on embedding fields — making OpenSearch the canonical choice for hybrid retrieval. RAG pipelines built on Bedrock often use OpenSearch as the retrieval store.

When OpenSearch k-NN Is Right

Hybrid keyword-plus-vector search, very large vector counts (100 M+), aggregation alongside similarity (top-K vectors filtered by category), and existing Elasticsearch/OpenSearch usage. OpenSearch is heavier operationally than MemoryDB (cluster sizing, index management) but more flexible.

OpenSearch Serverless

OpenSearch Serverless auto-scales compute and storage and bills per OCU (OpenSearch Compute Unit). Vector engine is a dedicated mode optimized for embedding workloads. The exam may surface "managed serverless vector store with hybrid search" as the trigger for OpenSearch Serverless vector engine.

Common Exam Traps For DynamoDB, Aurora, And Vector Stores

Memorize all five.

Trap 1 — DynamoDB For Analytics

A scenario describes complex aggregation queries over a large customer table. Wrong answer: DynamoDB. Right answer: export to S3 (DynamoDB export or Streams to Kinesis to S3) and query with Athena or Redshift. DynamoDB excels at point lookups, not analytical scans.

Trap 2 — Aurora For 100K Writes Per Second

A scenario describes IoT ingestion at 100K writes per second. Aurora maxes out around 200K writes per second on the largest writer instance with significant tuning; DynamoDB scales horizontally without limit. Right answer for high-throughput key-value writes: DynamoDB.

Trap 3 — pgvector For Billion-Vector Workloads

A scenario describes a RAG pipeline over a billion document chunks. pgvector technically supports it but performance degrades. Right answer: OpenSearch k-NN or MemoryDB depending on hybrid-search requirements.

Trap 4 — DynamoDB Global Tables For Strong Consistency

A scenario asks for strongly-consistent multi-region writes. DynamoDB Global Tables is eventually consistent across regions. Right answer for strong cross-region consistency: there isn't one in DynamoDB — use Aurora Global Database with manual writer-region routing, or accept eventual consistency with conflict resolution.

Trap 5 — RDS Over Aurora For New PostgreSQL Workload

A scenario describes a new application needing PostgreSQL. The trap suggests RDS for PostgreSQL because it is "more standard." Right answer: Aurora PostgreSQL — same compatibility, faster, more durable, better failover, lower cost at scale.

DynamoDB hot partitions throttle a workload regardless of provisioned capacity if the partition key has low cardinality, and the right answer is almost never "increase capacity." The trap question describes a DynamoDB table where 90 percent of writes target a single partition key (status="active", region="us-east-1", or a date string). Adding capacity does not help because DynamoDB allocates capacity per partition. The right answer is to redesign the partition key — adding a random suffix (status#00 through status#99), using a composite key with high-cardinality data, or using write sharding with a write-side dispatcher. The DEA-C01 exam plants this as a performance question with cost-control distractors. Never pick "increase RCU/WCU" or "switch to on-demand" — pick "redesign partition key for higher cardinality."

DynamoDB And Aurora In Data Engineering Pipelines

Beyond store selection, the DEA-C01 exam tests how these stores integrate with data engineering pipelines.

Pattern 1 — DynamoDB Streams To S3 Data Lake

DynamoDB → Streams → Lambda → Firehose → S3 (Parquet) → Glue Catalog → Athena. The canonical CDC pipeline that lands transactional changes into the analytics data lake without affecting the production table.

Pattern 2 — Aurora Zero-ETL To Redshift

Aurora MySQL or PostgreSQL → zero-ETL → Redshift. Continuous replication of transactional data into the analytics warehouse with no Glue jobs, no DMS replication instances. Limitation: not all data types replicate (large objects, certain custom types).

Pattern 3 — DynamoDB Export To S3 For Daily Analytics

DynamoDB → on-demand export → S3 (DynamoDB JSON) → Glue ETL to Parquet → Athena/Redshift. The right pattern when continuous CDC is overkill and a daily snapshot suffices.

Pattern 4 — Aurora Read Replica For Reporting

Aurora writer → Aurora read replica → BI tool (QuickSight). Use a read replica as the reporting endpoint to avoid impacting transactional latency. Trade-off: replica lag exists, queries see slightly stale data.

Pattern 5 — Vector Pipeline For RAG

S3 documents → Glue/Lambda chunking → Bedrock Titan embeddings → OpenSearch k-NN or pgvector → application retrieval at query time. The DEA-C01 angle is the data engineering plumbing, not the LLM inference (which is MLA-C01 territory).

Plain-Language Explanation: DynamoDB, Aurora, And Vector Stores

Three concrete analogies to make these stores intuitive.

Analogy 1 — The Restaurant That Decides Counter Vs Dining Room Vs Sommelier

DynamoDB is the fast-food counter. You walk up, say "burger combo number three, partition key 47, sort key 02," the counter staff knows exactly which slot in which warming rack to grab from, and you have your food in seven seconds. Ten thousand customers a minute? Add more registers; each register handles its own queue independently. But the counter cannot make you a multi-course tasting menu. Aurora is the sit-down dining room — slower per visit, but the kitchen can prepare a five-course meal with wine pairings, dietary restrictions, and table-side flambé. The dining room scales by adding tables (read replicas) but ultimately the head chef (writer instance) is one person. The vector stores are the sommelier: you describe the wine you want with vague language ("something earthy, low tannin, similar to that pinot from last spring"), and the sommelier returns the closest five matches from the cellar — pgvector is a sommelier who happens to also work the dining room, MemoryDB is a sommelier with the cellar mapped in their head and zero hesitation, OpenSearch k-NN is a sommelier with a card catalog who can also filter by region, vintage, and price simultaneously. The DEA-C01 trap is sending a five-course party to the fast-food counter (DynamoDB for analytics) or asking the dining room to handle a thousand orders a second (Aurora for IoT ingestion).

Analogy 2 — The Library With Stacks, Reading Rooms, And Concierge

DynamoDB is the library's call-number lookup system: you give a Dewey-Decimal number, the librarian fetches that book in five seconds. Lightning fast for known-key retrieval, useless for "find me books similar to Walden in mood." Aurora is the reading room with topic-organized stacks, where the librarian can run complex cross-referenced queries — find every book published by a Boston press between 1850 and 1880 that cites Emerson — but it takes a few minutes and the librarian does the work. Vector stores are the concierge with embeddings: you bring a sample paragraph and ask "what else reads like this?" and the concierge returns five matching books based on semantic similarity, not keyword overlap. pgvector is the concierge who also moonlights as the reading-room librarian (one person, two jobs, fine for a small library). MemoryDB is the concierge with an eidetic memory — every book embedding cached in their brain, instant recall. OpenSearch k-NN is the concierge with both an embedding card and a topic taxonomy, returning "books semantically similar to this paragraph that are also tagged 19th-century-American-literature." DynamoDB Streams is the bell that rings every time a book is checked in or out, so a clerk in another room can update the inventory log.

Analogy 3 — The Postal System With Express, Bulk Mail, And Tracking Drones

DynamoDB is the express courier with global addressing: every package has a unique tracking number (partition key), the courier hashes the number to a depot, and the package arrives in milliseconds — but a million packages a second works only if no single depot is overloaded (hot partition trap). Aurora is the regional postal hub with branch offices (read replicas) sharing one warehouse (the storage volume) — the hub processes complex routing logic (joins, transactions, foreign keys) and scales by adding branches, but only the hub can issue dispatch orders (writes go to the writer instance). Aurora Serverless v2 is the postal hub with rented loading docks that appear and disappear based on demand. DynamoDB Streams is the conveyor belt that captures every package movement and feeds it to downstream systems — Lambda routers that fan out updates to inventory dashboards, S3 data lakes, OpenSearch indexes. Aurora zero-ETL to Redshift is the dedicated freight train that copies every transaction directly to the analytics warehouse without manual loading. The vector stores are the package-similarity search — you describe a package's contents in fuzzy terms ("electronic device about this size, looks expensive") and the system returns the K most-similar packages it has seen, pgvector being the search built into the postal hub itself, MemoryDB being a separate hyper-fast lookup desk, OpenSearch being a search desk that also filters by ZIP code, weight class, and declared value at the same time.

Key Numbers And Must-Memorize Facts

DynamoDB

Single-digit millisecond P99 latency at any scale
400 KB max item size
1024-byte partition key limit
24-hour DynamoDB Streams retention
On-demand: per-request billing, no capacity planning
Provisioned: per-second RCU/WCU, ~5x cheaper at sustained load
Global Tables: eventually consistent multi-region, last-writer-wins
Hot partition: ~3000 RCU / 1000 WCU per partition limit

Aurora

Up to 15 read replicas sharing one storage volume
Storage auto-scales to 128 TiB
6-way replication across 3 AZs
30-second automated failover
Aurora Serverless v2: 0.5 ACU minimum, 128 ACU maximum, 0.5 ACU increments
Aurora Global Database: <1 second cross-region replication, 1-minute RPO
Zero-ETL to Redshift: continuous, managed

Vector Stores

pgvector: HNSW or IVFFlat indexes, up to ~10M vectors practical
MemoryDB vector search: HNSW only, sub-millisecond p99
OpenSearch k-NN: HNSW + IVF + Lucene, billions of vectors, hybrid search
HNSW: graph-based, fast queries, larger memory, slower build
IVF: clustered, smaller memory, faster build, lower recall

Use DynamoDB export to S3 for periodic analytics; use DynamoDB Streams for continuous CDC; use zero-ETL to OpenSearch or Redshift for managed continuous replication. The DEA-C01 exam tests this decision tree directly. Pattern: "the team needs daily reporting on yesterday's transactions" => DynamoDB export to S3 + Glue + Athena. "The team needs sub-second updates to a search index" => DynamoDB Streams + Lambda + OpenSearch (or zero-ETL to OpenSearch). "The team needs near-real-time analytics in Redshift without ETL code" => zero-ETL to Redshift. Never pick "scan the DynamoDB table from Athena via federated query" for production analytics — the per-query latency and DynamoDB read cost are wrong. Always export first or stream changes; query downstream.

DEA-C01 exam priority — DynamoDB, Aurora, and Vector Stores. This topic carries weight on the DEA-C01 exam. Master the trade-offs, decision boundaries, and the cost/performance triggers each AWS service exposes — the exam will test scenarios that hinge on knowing which service is the wrong answer, not just which is right.

FAQ — DynamoDB, Aurora, And Vector Stores Top Questions

Q1 — When should I choose DynamoDB versus Aurora for a new application?

Choose DynamoDB when access patterns are key-value or simple key-and-range, throughput is high or unpredictable, sub-10ms latency is required, and the application does not need joins or transactions across many tables. Choose Aurora when the application needs SQL semantics, multi-table joins, foreign-key relationships, complex transactions, or a relational schema that fits a hierarchy of business entities. The decision rule: if you can express the workload as GET item by key or query items by partition key + range condition 90 percent of the time, DynamoDB is right. If the workload routinely joins three or more tables in a single query, Aurora is right. The DEA-C01 exam plants this with scenarios that describe access patterns explicitly — read carefully for "key-based lookup" (DynamoDB) versus "multi-table join" (Aurora).

Q2 — How do I run analytics on DynamoDB without affecting the production table?

Two patterns: continuous and periodic. Continuous: enable DynamoDB Streams, attach a Lambda or use the DynamoDB-to-Kinesis option, write changes to Firehose to S3 in Parquet, register the table in the Glue Data Catalog, query with Athena or load into Redshift. The pipeline runs in near-real-time and never touches the production table. Periodic: use DynamoDB on-demand export to S3 for a point-in-time snapshot in DynamoDB JSON, transform to Parquet with Glue, query with Athena. The export does not consume table read capacity. For both patterns, do NOT run analytical queries against the live DynamoDB table — that consumes RCU, can throttle production reads, and is expensive at scale. The DEA-C01 exam plants this as a "production OLTP plus analytics" scenario with the wrong-answer "use Athena federated query against DynamoDB" — that works for ad hoc but is wrong for production analytics.

Q3 — When should I use Aurora Serverless v2 versus provisioned Aurora?

Use Aurora Serverless v2 when workload is variable or unpredictable — dev/test environments, multi-tenant SaaS where each tenant database has unpredictable load, applications with burst patterns. Use provisioned Aurora when the workload is steady and predictable — known production traffic with stable peaks. The trade-off is cost: Serverless v2 bills per 0.5 ACU-second and is roughly 1.5x to 2x the equivalent provisioned cost at steady load, but saves money during idle periods because it scales down to 0.5 ACU minimum. For a 24/7 production OLTP at fixed RPS, provisioned wins on cost; for a development environment used 8 hours per day, Serverless v2 wins. Aurora Serverless v2 supports the same features as provisioned (read replicas, Global Database, pgvector) so capability is not the deciding factor.

Q4 — How do I choose between pgvector, MemoryDB vector search, and OpenSearch k-NN?

Decision tree by workload size and integration: pgvector is right when vectors live alongside relational data in an existing PostgreSQL application, vector count is under ~10M, and operational simplicity matters (one database, one connection pool). MemoryDB vector search is right when sub-millisecond p99 latency is required, vector count is in the tens of millions, and HNSW recall trade-offs are acceptable. OpenSearch k-NN is right when vector count exceeds 100M, hybrid keyword-plus-vector search is needed (RAG with metadata filtering), or the team already runs OpenSearch for log analytics. For RAG pipelines specifically, OpenSearch is the canonical choice because it supports the "filter by document type, then rank by semantic similarity" pattern in a single query. Vector index types: HNSW for fastest queries with stable corpus, IVF for memory-constrained or rapidly-changing corpora.

Q5 — Is DynamoDB Global Tables strongly consistent across regions?

No. DynamoDB Global Tables replicates writes asynchronously across regions and uses last-writer-wins conflict resolution based on a timestamp attribute. Within a single region, DynamoDB supports strong consistency on reads if you request it (ConsistentRead=true); across regions, the model is eventually consistent and you must design the application to tolerate it. For workloads that genuinely need strongly-consistent multi-region writes, the answer is not DynamoDB — consider Aurora Global Database with all writes routed to the primary region, accepting the cross-region write latency. The DEA-C01 exam plants this trap by describing a "globally distributed user base needing strong consistency" — the right answer is to question the consistency requirement (most globally distributed apps actually tolerate eventual consistency with conflict resolution) or to use Aurora Global Database with regional read replicas and primary-region writes.

Q6 — How does DynamoDB zero-ETL integration with OpenSearch work?

DynamoDB zero-ETL to OpenSearch creates a managed replication that syncs DynamoDB table changes to an OpenSearch index continuously without pipeline code. AWS handles the transformation, error retries, and backfill from the existing table. You configure the integration in the DynamoDB console, point at an OpenSearch destination, and let AWS keep them in sync. Use cases: full-text search over DynamoDB items, dashboards that need OpenSearch aggregations on transactional data, AI applications that pair DynamoDB key lookups with OpenSearch vector search. The integration is similar in spirit to Aurora zero-ETL to Redshift — both eliminate the Glue/DMS pipeline overhead. The DEA-C01 exam may surface this as the right answer for "near-real-time search index over DynamoDB without managed pipeline overhead."

Q7 — What is a hot partition in DynamoDB and how do I avoid it?

A hot partition is when a disproportionate share of reads or writes targets one DynamoDB partition (defined by the hashed partition key value), exceeding the per-partition throughput limit (~3000 RCU / 1000 WCU). The result is throttled requests even when the table-level provisioned capacity is far from exhausted. Causes: low-cardinality partition keys (status, region, date string), single popular item read by every user, time-series data keyed by current date. Avoidance: choose high-cardinality partition keys (user_id, device_id, order_id), add a random suffix to common keys (status#42 where 42 is a random 0-99 bucket), use composite keys that distribute load (user_id + timestamp instead of timestamp alone). The DEA-C01 exam plants this with a performance question — wrong answers suggest scaling capacity; the right answer redesigns the partition key.

Data Store Selection Framework — OLTP, OLAP, Streaming, Vector

OLTP — Transactional Reads And Writes

OLAP — Analytical Aggregations

Streaming — Continuous Event Flow

Vector — Similarity Search Over Embeddings

Amazon DynamoDB — Key-Value And Document NoSQL

Latency And Throughput Profile

Capacity Modes — On-Demand vs Provisioned

Partition Keys And Hot Partitions

Global Tables — Multi-Region Active-Active

DynamoDB As An OLTP Source — Not An Analytics Store

DynamoDB Streams — CDC Source For Pipelines

How It Works

Lambda Trigger Pattern

Kinesis Data Streams For DynamoDB

DynamoDB Export To S3

Zero-ETL Integration

Amazon Aurora — Relational With MPP-Style Storage

Architecture — Decoupled Storage And Compute

Aurora MySQL vs Aurora PostgreSQL

Aurora Serverless v2

Aurora Global Database

Aurora Zero-ETL Integration With Redshift

Aurora vs RDS

Aurora PostgreSQL With pgvector — Vector Similarity Search

What pgvector Does

When pgvector Is The Right Answer

When pgvector Is Wrong

Amazon MemoryDB For Redis — Durable Vector Search

MemoryDB vs ElastiCache

Vector Search With HNSW

Vector Index Types — HNSW vs IVF

Amazon OpenSearch Service k-NN — Hybrid Search

k-NN Plugin

When OpenSearch k-NN Is Right

OpenSearch Serverless

Common Exam Traps For DynamoDB, Aurora, And Vector Stores

Trap 1 — DynamoDB For Analytics

Trap 2 — Aurora For 100K Writes Per Second

Trap 3 — pgvector For Billion-Vector Workloads

Trap 4 — DynamoDB Global Tables For Strong Consistency

Trap 5 — RDS Over Aurora For New PostgreSQL Workload

DynamoDB And Aurora In Data Engineering Pipelines

Pattern 1 — DynamoDB Streams To S3 Data Lake

Pattern 2 — Aurora Zero-ETL To Redshift

Pattern 3 — DynamoDB Export To S3 For Daily Analytics

Pattern 4 — Aurora Read Replica For Reporting

Pattern 5 — Vector Pipeline For RAG

Plain-Language Explanation: DynamoDB, Aurora, And Vector Stores

Analogy 1 — The Restaurant That Decides Counter Vs Dining Room Vs Sommelier

Analogy 2 — The Library With Stacks, Reading Rooms, And Concierge

Analogy 3 — The Postal System With Express, Bulk Mail, And Tracking Drones

Key Numbers And Must-Memorize Facts

DynamoDB

Aurora

Vector Stores

FAQ — DynamoDB, Aurora, And Vector Stores Top Questions

Q1 — When should I choose DynamoDB versus Aurora for a new application?

Q2 — How do I run analytics on DynamoDB without affecting the production table?

Q3 — When should I use Aurora Serverless v2 versus provisioned Aurora?

Q4 — How do I choose between pgvector, MemoryDB vector search, and OpenSearch k-NN?

Q5 — Is DynamoDB Global Tables strongly consistent across regions?

Q6 — How does DynamoDB zero-ETL integration with OpenSearch work?

Q7 — What is a hot partition in DynamoDB and how do I avoid it?

Further Reading — Official AWS Documentation

Official sources

More DEA-C01 topics