Production
Generative AI
on AWS
Copyright & Disclaimer
Production Generative AI on AWS — A Field Guide for the Developer–Professional Exam
Edition 1.0, Demo (Chapters 1 and 2 only), published 2026.
Copyright © 2026 Pongo Tech OÜ. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law.
Published by MineCloudCraft Press, an imprint of Pongo Tech OÜ.
Independent Publication
This is an independent publication and is not affiliated with, endorsed by, sponsored by, or otherwise authorized by Amazon Web Services, Inc., Amazon.com, Inc., or any of their subsidiaries or affiliates. AWS®, Amazon Web Services®, Amazon Bedrock™, Amazon SageMaker™, and all other AWS service names, marks, and logos are trademarks of Amazon.com, Inc. or its affiliates. References to these marks in this guide are made for educational and informational purposes only.
The AWS Certified Generative AI Developer — Professional examination (AIP-C01) is administered by Amazon Web Services. This guide is a study aid prepared independently and does not represent official certification material.
No Warranty
The information in this guide is provided “as is” and without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, or non-infringement. AWS services, pricing, regional availability, and feature sets change frequently — always verify current details against official AWS documentation before making implementation or examination decisions. Neither the authors nor the publisher shall be liable for any loss of profit, business interruption, or any other commercial damages arising from use of this material.
Trademarks
All trademarks referenced in this work are the property of their respective owners. Use of a trademark in this guide does not imply endorsement of the guide by the trademark holder.
Contents
How to Use This Guide
The AWS Certified Generative AI Developer — Professional exam (AIP-C01) is among the first professional-level certifications focused entirely on building production GenAI systems on AWS. This is not a survey course in machine learning, nor a prompt-engineering quiz. The exam tests your ability to make architectural and operational decisions under realistic constraints — latency budgets, compliance regimes, cost ceilings, accuracy thresholds, and vendor trade-offs.
This guide is structured to mirror that decision-making muscle. Every chapter follows the same rhythm: concept first, then services, then decision framework, then practice. You will not find encyclopedic dumps of every AWS feature. You will find the features the exam actually probes, organized so you can recall them under time pressure. Along the way you’ll meet a small set of recurring visual primitives — mental-model figures, service spotlight cards, comparison tables, decision lists, and code-compare panes — that earn their place by carrying the highest-leverage decisions of each chapter.
Who this guide is for
AIP-C01 is a professional-level certification. The content of this book is correspondingly intermediate to advanced and assumes you have already built or shipped at least one cloud application on AWS — you have used IAM, VPCs, S3, Lambda, CloudWatch, and at least one compute or container service in anger — and that you have hands-on familiarity with at least one large language model and a basic understanding of how vector retrieval and embeddings work.
If you are entirely new to AWS or to generative AI, start with the AWS Cloud Practitioner and AI Practitioner certifications first; this guide will move faster than is comfortable for absolute beginners. We will not stop to explain what an S3 bucket is, what an IAM role looks like, what a VPC endpoint does, or what a foundation model is — the exam doesn’t and neither do we.
You are likely:
- A backend or platform engineer with several years of AWS experience, now integrating foundation models into production systems.
- An ML engineer with classical-ML background expanding into generative workloads.
- A solutions architect needing to defend GenAI design choices to security, finance, and product stakeholders.
- A consultant who needs to walk into client conversations with current, exam-grade fluency in the AWS GenAI stack.
Treat anything that lands as “new to me, but I’ve seen the building blocks” as the target reading level. If a chapter feels like it’s assuming things you’ve never touched (a VPC endpoint policy, a SageMaker endpoint, a Lambda execution role), pause — build the missing block in the console for half an hour — then come back. The exam is hands-on by design; the book is hands-on by design.
How to read this book
Each chapter is self-contained and can be read in isolation, but the parts build on each other. Domain 1 establishes vocabulary. Domain 2 covers integration patterns that the rest of the book assumes. Domains 3, 4, and 5 each focus on a non-functional concern: safety, efficiency, and quality. The pattern within each chapter is consistent:
Three callout types appear throughout. Read them — they carry the highest density of exam-relevant material:
A nine-week study plan
The plan below assumes about 10–12 hours per week of focused study. If you have more, accelerate; if you have less, stretch the schedule and prioritize Weeks 1, 4, and 9.
| Week | Focus | Outcome |
|---|---|---|
| 1 | Read Part I (Domain 1, Chapters 1–6) — foundation models, RAG, vector stores, prompts | Architect a RAG solution on paper |
| 2 | Hands-on lab: Bedrock + Knowledge Bases + OpenSearch Serverless | End-to-end RAG demo working |
| 3 | Read Part II (Domain 2, Chapters 7–11) — agents, deployment, FM APIs, MLOps | Build a working Bedrock Agent with tool calls |
| 4 | Read Part III (Domain 3, Chapters 12–15) — guardrails, encryption, governance, responsible AI | Configure Bedrock Guardrails + CloudTrail logging |
| 5 | Read Part IV (Domain 4, Chapters 16–18) — cost, latency, monitoring | Build a CloudWatch dashboard for an LLM workload |
| 6 | Read Part V (Domain 5, Chapters 19–20) — evaluation, troubleshooting | Implement an LLM-as-a-judge eval pipeline |
| 7 | Walk Back Matter A (Exam Strategy) + B (Glossary, spaced repetition) | Internalize the five-pass MCQ procedure |
| 8 | Practice exams · review missed questions · re-read flagged callouts | Score ≥ 80% on practice consistently |
| 9 | Final review · Cheat Sheets (Back Matter C) · book the exam | Pass on first attempt |
What this guide is not
This is not an AWS service catalog, not a Python tutorial, and not a substitute for hands-on practice. Where source material includes long code listings, we have summarized the conceptual takeaway and pointed you to AWS documentation for current SDK syntax. Treat the official AWS documentation as canonical for any code you intend to ship. The Back Matter sections (Exam Strategy, Glossary, Cheat Sheets) are where the book’s decisions distill into something you can re-read in the fifteen minutes before the test — budget time for them.
A note from the author
I wrote this guide for myself first. I was preparing for the AIP-C01 exam in early 2026 and could not find a study resource that combined the depth I needed with the structure my brain wanted — concepts, then services, then decision frameworks, then drills, with no padding. So I built one. I sat the exam, I passed, and the material in your hands is the same material I used to get there.
This is not an official AWS publication, and it is not a replacement for hands-on practice or for the AWS documentation. It is an opinionated, decision-oriented study guide — a real candidate’s playbook rather than a marketing document. If it helped me pass, I’m confident it can help you pass too.
Good luck. Now: open Part I, brew something hot, and let’s begin.
The AIP-C01 Exam, In One Chapter
The AWS Certified Generative AI Developer — Professional exam (code: AIP-C01) is a 180-minute, scenario-driven examination consisting of 100 questions — 85 scored and 15 unscored. You won’t be told which are which; treat them all as scored. It’s delivered through Pearson VUE testing centers and as an online proctored exam, and it carries a recommended prerequisite of two or more years of hands-on experience designing and operating GenAI workloads on AWS.
You do not need to memorize service quotas or current pricing. You do need to be able to look at a multi-paragraph scenario — complete with constraints, red herrings, and competing priorities — and pick the architecture that satisfies the requirements at the lowest reasonable cost and operational burden.
Domain weightings
Five domains are weighted as follows. The percentages are official AWS guidance; treat them as your study budget allocator.
| # | Domain | Weight | Tasks |
|---|---|---|---|
| 1 | Foundation Model Integration & Data Management | 27% | 1.1 – 1.6 |
| 2 | Implementation & Integration | 26% | 2.1 – 2.5 |
| 3 | Security, Compliance & Governance | 20% | 3.1 – 3.4 |
| 4 | Operational Efficiency & Optimization | 14% | 4.1 – 4.3 |
| 5 | Evaluation & Troubleshooting | 13% | 5.1 – 5.2 |
Question formats
The exam uses three question formats. Knowing the difference matters because the marking rules differ.
- Multiple choice — one stem, four options, exactly one correct answer. The most common format.
- Multiple response — one stem, five or more options, two or three correct answers. Partial credit is not awarded; you must select the exact correct subset.
- Scenario / case study — a long preamble (architecture diagram, customer requirements, constraints) followed by 2–4 dependent questions. Read the preamble carefully before starting; the same setup feeds multiple questions.
Passing standard
The reported passing score is approximately 750 / 1000, but AWS uses a scaled scoring model with statistical equating. Aim for ≥ 80% on practice exams to be comfortable on test day. Score reports break results down by domain (“Meets Competencies / Needs Improvement”); use those to direct your final review week.
Time budget
180 minutes for 100 questions = ~1:48 per question average. In practice, scenario questions consume 3–5 minutes each, while plain multiple-choice items can be answered in under a minute. Use the in-exam Flag for Review feature liberally; first-pass anything you cannot answer in 90 seconds, finish the easy questions, and circle back. Back Matter A (Exam Strategy) walks the full five-pass pacing plan in detail.
What the exam loves to test
Across all five domains, expect heavy emphasis on these decision pivots. These are the seams where two services overlap and the “right” answer depends on a single qualifier in the question:
- Bedrock vs. SageMaker JumpStart vs. self-hosted — managed convenience vs. customization vs. control.
- RAG vs. fine-tuning vs. continued pre-training — data freshness, knowledge depth, cost.
- Amazon Bedrock Knowledge Bases vs. Kendra vs. raw OpenSearch — managed vs. document ACLs vs. flexibility.
- On-demand vs. provisioned throughput vs. batch inference — latency, cost, predictability.
- Bedrock Agents vs. custom orchestration with the Converse API — managed vs. flexible.
- Bedrock Guardrails vs. application-level filters vs. Amazon Comprehend — safety surface area.
Each of these pivots gets its own decision framework in the chapters that follow.
Foundation Model
Integration & Data Management
Domain 1 of the AIP-C01 covers the end-to-end lifecycle of a foundation-model workload — from translating a business problem into an architecture, through choosing and configuring the model, building data pipelines, indexing into vector stores, retrieving relevant context, and governing the prompts that drive it all. This is where every GenAI application begins.
Analyze Requirements & Design GenAI Solutions
1.1 · Is this even a GenAI problem?
The most expensive mistake in generative AI is using a foundation model where a regular expression would do. Before you reach for Bedrock, work through a four-question checklist. If you cannot answer yes to at least three, your problem belongs to traditional ML, classic search, or simple business logic. The exam punishes over-engineering.
- Does the task require language understanding, generation, or transformation? — summarization, drafting, translation, intent extraction. If the answer is “classification with structured features,” reach for Amazon SageMaker or Amazon Comprehend instead.
- Is the input variable, unstructured, or open-ended? — free-form support tickets, PDFs, conversational queries. Foundation models excel at variability.
- Can the system tolerate probabilistic output? — if every response must be 100% accurate (think: tax calculations, medical dosages), you need a deterministic system underneath, with the LLM only as an orchestrator.
- Is sufficient context obtainable? — either through prompts, retrieval, or fine-tuning. If the answer lives in a database the model cannot reach, no amount of prompting will help.
Common GenAI use case categories
The AIP-C01 exam organizes GenAI use cases into five categories. Memorize the canonical AWS service for each — questions often hide the use case in a verb (“summarize”, “classify”, “generate”) rather than name it directly.
| Category | Examples | Default service |
|---|---|---|
| Text generation | Email drafting, content creation, conversational assistants | Amazon Bedrock (Anthropic Claude, Meta Llama, Amazon Nova) |
| Code generation | Code completion, refactoring, test generation | Amazon Q Developer (Bedrock under the hood) |
| Summarization & extraction | Long-document summaries, structured field extraction | Bedrock + structured output (tool use / JSON mode) |
| Image & multimodal | Image creation, visual Q&A, document understanding | Bedrock (Stable Diffusion, Amazon Titan Image, Nova Canvas) |
| Search & knowledge | Natural-language Q&A over private data | Bedrock Knowledge Bases · Amazon Kendra · custom RAG on OpenSearch |
1.2 · Functional & non-functional requirements
Once you have decided GenAI is appropriate, the requirements analysis you would do for any system splits into three buckets. The exam is unusually fond of latency / cost / compliance trade-offs, so each row in the table below is a likely question seam.
Functional requirements
What the solution must accomplish. For foundation-model workloads this includes: input modalities (text, images, audio, documents); expected output format and grounding requirements; required domain knowledge; integration points; and the interaction pattern (synchronous chat, streaming, batch). One requirement matters most: does the app need multi-turn conversational state? If yes, you need session management. If no, a stateless InvokeModel call works.
Non-functional requirements
Latency, throughput, cost, availability, security, and compliance. These usually determine the architecture rather than influence it.
| Concern | Diagnostic question | AWS lever |
|---|---|---|
| Latency | Is end-user response time under 2 s critical? | Smaller model · Bedrock provisioned throughput · response streaming |
| Throughput | Sustained TPS at peak? | Provisioned throughput · SageMaker auto-scaling · async inference |
| Cost | Cost per inference / monthly ceiling? | Model right-sizing · Bedrock batch · prompt & semantic caching |
| Availability | RTO / RPO requirements? | Multi-region deployment · Bedrock cross-region inference |
| Security | Data classification & access control? | IAM · AWS KMS · VPC endpoints · Bedrock Guardrails |
| Compliance | HIPAA / PCI / GDPR / SOC 2? | Regional deployment · CloudTrail · AWS Config · data residency |
| Data volume | How much context / how much training data? | S3 · OpenSearch · Kendra · SageMaker training |
| Accuracy | Quality threshold & tolerance for errors? | Bedrock Guardrails · human-in-loop · automated evaluation |
1.3 · The five canonical AWS GenAI patterns
Almost every architecture the exam asks you to design is a composition of these five patterns. Memorize their trigger conditions; the exam phrases scenarios specifically so that exactly one pattern fits.
Walk down the tree from the top. Stop at the first “yes.” The exam writes scenarios so that exactly one branch fits cleanly.
Pattern 1 · Direct API integration
The simplest pattern: your application calls InvokeModel or Converse on Amazon Bedrock, the foundation model returns a response, and your code post-processes it. Conversation state lives in your application layer (DynamoDB, ElastiCache, or memory). Compute is typically AWS Lambda or a container on Amazon ECS / EKS.
Use this pattern when the model’s training data is sufficient knowledge for the task — summarization, translation, brainstorming, generic Q&A, classification of free-text input. The moment you need your data in the response, you graduate to RAG.
InvokeModel vs. Converse — the same call, two eras. The older InvokeModel API requires a model-specific JSON body. Anthropic, Meta, and Cohere expect different shapes. Swapping models means rewriting the request. The newer Converse API normalizes that — one request shape, one response shape, across every Bedrock-hosted model. Use Converse for anything new; treat InvokeModel as legacy.
import boto3, json bedrock = boto3.client("bedrock-runtime") # Anthropic-specific body shape — different # for Llama, Cohere, Titan, etc. body = { "anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "messages": [ {"role": "user", "content": "Summarize CRISPR in 3 lines."} ], } resp = bedrock.invoke_model( modelId="anthropic.claude-3-5-sonnet-20240620-v1:0", body=json.dumps(body), contentType="application/json", ) # Parse model-specific response shape data = json.loads(resp["body"].read()) print(data["content"][0]["text"])
import boto3 bedrock = boto3.client("bedrock-runtime") # Same shape works for Claude, Llama, Nova, # Cohere, Mistral — swap modelId only. resp = bedrock.converse( modelId="anthropic.claude-3-5-sonnet-20240620-v1:0", messages=[ {"role": "user", "content": [{"text": "Summarize CRISPR in 3 lines."}]} ], inferenceConfig={"maxTokens": 512}, ) # Uniform response shape text = resp["output"]["message"]["content"][0]["text"] print(text) # resp["stopReason"] tells you why generation ended # resp["usage"] gives input/output token counts
converse_stream), and tool use without per-provider hacks.Converse APIProvider-agnostic API that handles multi-turn message state, tool use, and streaming uniformly across Bedrock-hosted models. Replaces the older InvokeModel for all conversational workloads.
Use when
You want a single integration that works across Anthropic, Meta, Cohere, Mistral, and Amazon Nova models without rewriting your client code per provider.
Pattern 2 · Retrieval-Augmented Generation (RAG)
RAG is the dominant enterprise pattern. The workflow: embed the user query, search a vector store for relevant chunks, splice the retrieved text into the prompt, then call the foundation model. The model now “knows” about your private data without ever being trained on it.
RAG is the right answer when the source-of-truth changes more frequently than you can retrain a model (most enterprise data), when you must cite sources, or when the corpus is too large to fit in any context window. Two implementation paths:
- Managed RAG — Amazon Bedrock Knowledge Bases handles ingestion, chunking, embedding, and retrieval. Lowest operational overhead. Choose this unless you need something it does not do.
- Custom RAG — you build the pipeline yourself using Amazon OpenSearch Service (or Aurora pgvector), Bedrock embedding models, and your own orchestration. Choose this when you need fine control over chunking, hybrid search, custom metadata filtering, or non-AWS vector databases.
Pattern 3 · Agents and tool use
An agent extends a foundation model with the ability to take actions. The model decides which function to call, supplies arguments, the function executes (a Lambda function, an API call, a database query), the result is fed back, and the agent decides what to do next. This loop continues until the agent has enough information to respond.
Amazon Bedrock Agents wraps this with managed orchestration: you define an instruction, register action groups (Lambda functions or OpenAPI schemas), optionally attach knowledge bases, and Bedrock handles the reasoning loop. For more flexibility, build a custom agent on top of the Bedrock Converse API’s tool-use feature.
Managed orchestration of multi-step tool-using workflows. Handles prompt construction, tool selection, parameter inference, and conversation state. Integrates natively with Lambda action groups and Knowledge Bases.
Use when
The task requires multi-step actions across systems (lookup → decide → act → confirm) and you do not want to hand-roll the orchestration loop in your application code.
Pattern 4 · Fine-tuning & custom models
Fine-tuning adapts a foundation model’s weights using your data. Reach for it when prompting and RAG cannot achieve the consistency you need — specialized output format, domain vocabulary, or brand voice. Bedrock supports continued pre-training and fine-tuning for select models; SageMaker JumpStart offers more model choice and full training control.
Pattern 5 · Multi-model & ensemble
A small classifier routes traffic to a large generation model only when needed; an embedding model handles search, while a generation model handles synthesis; multiple models propose answers and a judge selects the best. Bedrock’s unified API makes this trivial to compose — you can call Claude, Titan, and Stable Diffusion from one application without managing three SDKs.
1.4 · Cost optimization at design time
Cost is decided at the architecture stage, not at the bill stage. Three levers dominate.
- Right-size the model. The largest model is rarely the right model. Try the smallest capable model first, measure quality, escalate only if needed. A factor-of-10 cost differential between Nova Micro and Claude Opus is typical.
- Pick the right inference mode. On-demand pricing for variable workloads, provisioned throughput for predictable steady traffic, batch inference for offline workloads (up to 50% cheaper). Mismatched mode is the #1 cause of bill shock.
- Cache aggressively. Bedrock prompt caching reuses repeated system prompts at a discount. Application-level caching of identical queries can eliminate 20–60% of inference calls. Semantic caching (vector-similarity match on prior queries) extends this further.
1.5 · The Well-Architected GenAI lens, in summary
AWS Well-Architected gains one dimension for GenAI: Responsible AI. It threads across the existing six pillars rather than replacing them.
| Pillar | What to verify in a GenAI workload |
|---|---|
| Operational Excellence | Model versioning, prompt versioning, evaluation pipelines, automated rollback on quality regression. |
| Security | Data classification at ingest, KMS encryption, VPC isolation, IAM least-privilege per action group, Guardrails for input/output safety. |
| Reliability | Cross-region inference fallback, retry & backoff for throttling, graceful degradation when a model is unavailable. |
| Performance Efficiency | Right-sized models, response streaming, dynamic batching, semantic caching, parallel tool execution. |
| Cost Optimization | Token budgets, model cascading, batch inference, prompt caching, monitoring per-request cost. |
| Sustainability | Reuse of cached responses, batch over real-time when possible, choosing efficient models, regional placement. |
| Responsible AI | Bias evaluation, transparency & citation, opt-out, harm mitigation, fairness across user segments. |
Chapter summary
Designing GenAI is choosing the right tool, the right pattern, and the right model for the constraints in front of you.
- GenAI fit first — validate that GenAI is the right tool before designing. Not every problem needs a foundation model.
- NFRs → AWS levers — map latency, cost, and compliance onto specific services. Compliance usually eliminates options first.
- Five core patterns — Direct API, RAG, Agents, Fine-Tuning, Multi-Model. Almost every workload is one of these.
- Pattern triggers — RAG when knowledge is external or fresh; Agents when the system must take action; Fine-Tuning is the last resort, not the first.
- Cost is architectural — right-size the model, pick the right inference mode, cache. Decided at design time, not after launch.
- Well-Architected + Responsible AI — weave the Responsible AI thread across all six pillars.
The exam rewards architecture-time decisions; it punishes ‘biggest model on every problem’.
Review Questions
Five scenario MCQs. Reveal the explanation only after you commit to an answer — the cognitive cost of guessing-then-checking is what builds exam memory.
- Deploy a fine-tuned model on SageMaker trained on all insurance documents.
- Use Amazon Bedrock Knowledge Bases with OpenSearch Serverless, applying metadata filtering by plan type, with VPC endpoints and KMS encryption.
- Use Amazon Kendra with an S3 data source and document-level access control lists, integrated with a Bedrock foundation model for response generation.
- Provide all insurance documents inside the system prompt of a Bedrock foundation model.
Show answer & explanation
Correct: C. Amazon Kendra provides built-in document-level ACLs that map directly to the per-patient access requirement, and its native S3 connector handles PDF ingestion. Combined with a Bedrock model for natural-language response, this pattern satisfies access control, freshness, and compliance.
Why not B? Bedrock Knowledge Bases supports metadata filtering, but document-level user-aware ACLs require additional custom logic; Kendra offers this natively. Why not A? Fine-tuning bakes data into weights and cannot enforce per-user access. Why not D? Quarterly insurance corpora exceed any context window and offer no per-user filtering.
- Bedrock on-demand inference with parallel Lambda fan-out.
- Bedrock batch inference processing all reports as a single batch job.
- SageMaker endpoint with provisioned capacity.
- Bedrock with provisioned throughput.
Show answer & explanation
Correct: B. Bedrock batch inference is purpose-built for high-volume offline jobs at a meaningful per-token discount versus synchronous inference. With no real-time requirement, every other option pays a real-time premium that batch avoids.
- An Amazon Bedrock Agent with action groups for order management and returns, plus a Knowledge Base for product information.
- A RAG pipeline on OpenSearch containing both product and order data.
- A fine-tuned model trained on product catalog and historical orders.
- A direct Bedrock
InvokeModelcall with the order API documented in the system prompt.
Show answer & explanation
Correct: A. The requirement combines knowledge retrieval (product catalog) with multi-step actions (order lookup, returns) — exactly the agent pattern. Bedrock Agents support both action groups and Knowledge Bases natively in one runtime. RAG alone cannot act, fine-tuning cannot reach a live order system, and stuffing API specs into a prompt does not give the model a way to actually call them.
- Fine-tune a small Bedrock model on thousands of input/output examples.
- Build a RAG pipeline that retrieves the schema and includes it in every prompt.
- Use the Bedrock
ConverseAPI with tool use / structured output to enforce the schema. - Continue pre-training a model on schema documentation.
Show answer & explanation
Correct: C. Structured output via tool use enforces the schema at decoding time without retraining or retrieval overhead. Fine-tuning is overkill for format-only adaptation; RAG injects context but does not enforce structure; continued pre-training is even heavier than fine-tuning and inappropriate here.
- Bedrock provisioned throughput with a one-month commitment.
- Bedrock on-demand inference.
- A self-managed model on a SageMaker real-time endpoint.
- SageMaker serverless inference with a custom container.
Show answer & explanation
Correct: B. On-demand pricing matches unpredictable, low-volume workloads with zero commitment and no capacity management. Provisioned throughput requires a commitment that does not match the volume profile; SageMaker options shift operational burden onto a team that explicitly cannot absorb it.
Select & Configure Foundation Models
2.1 · The Bedrock model landscape
Amazon Bedrock is a managed surface for foundation models from Anthropic, Meta, Amazon Nova, Mistral, Cohere, AI21, DeepSeek, Writer, and Luma. You do not provision GPUs. You do not manage weights. You call an API. The exam expects you to know each family’s sweet spot — well enough to pick from four choices under time pressure.
| Family | Sweet spot | Watch out for |
|---|---|---|
| Anthropic Claude Sonnet · Haiku · Opus |
Long-context reasoning, tool use, instruction following, code, structured output. Default choice for agents and complex RAG. | Higher per-token cost than Haiku-tier models; Opus tier is slow. |
| Amazon Nova Micro · Lite · Pro · Premier |
AWS-native, lowest cost for tiered workloads, native multimodal (Lite/Pro), tight Bedrock integration. | Newer ecosystem; some advanced reasoning still trails Claude/GPT-class peers. |
| Amazon Titan Text · Embeddings · Image |
Embeddings (amazon.titan-embed-text-v2:0) are a default for RAG on AWS. Image generation when staying in-house. |
For new generation work, AWS is positioning the Nova family alongside Titan Text. |
| Meta Llama | Open-weight reasoning, customer wants portability or self-hosting on SageMaker. | Capabilities lag closed-weight peers at the same parameter count. |
| Cohere Command / Embed | Multilingual embeddings, retrieval, classification, RAG with strong non-English support. | Smaller community for tool-use patterns. |
| Mistral | Cost-effective European hosting, fast small-model inference, function calling. | Smaller context windows on lower tiers. |
| Stability AI | Image generation (SD3, SDXL) when output must be stylistically tunable. | Not a chat model; do not pick it for text tasks. |
2.2 · The selection trade-off — one mental model, four axes
Every model selection question collapses into four axes. The exam phrases scenarios so that one axis dominates — identify it, and the answer falls out.
2.3 · Inference parameters that actually matter
Bedrock’s Converse API exposes the same handful of inference parameters across every provider. Most candidates can name them. Few can predict what changes when you turn each one. The exam loves this gap.
| Parameter | Effect | When to change it |
|---|---|---|
temperature |
Scales the logit distribution. Low (0–0.3) → deterministic, repeatable. High (0.7–1.0) → creative, variable. | Use 0–0.2 for extraction, classification, function-calling. Use 0.7+ for brainstorming, marketing copy, creative drafts. |
top_p |
Nucleus sampling: keep tokens whose cumulative probability is ≤ p. Low p = narrower vocabulary. |
Combine with low temperature for tight, factual output. Rarely tune both at once — pick one knob. |
top_k |
Keep only the top k next-token candidates. Hard cap on diversity. |
Useful for very narrow domains (SQL, JSON only). Most providers default to a sensible value — leave alone unless you have a measured reason. |
max_tokens |
Hard cap on response length. Generation stops at max_tokens or stop sequence, whichever first. |
Always set it. It caps cost and stops runaway loops. Tune to the realistic 95th percentile of your task. |
stopSequences |
List of strings that, if emitted, terminate generation immediately. | Use for structured output ("\n\n", "") and when chaining prompts. |
system |
Persistent role / persona / rules above the conversation. | Always set it. The system prompt is where guardrails, tone, and output schema live. |
Configuring inference with the Converse API
import boto3, json bedrock = boto3.client("bedrock-runtime") resp = bedrock.converse( modelId="anthropic.claude-3-5-haiku-20241022-v1:0", system=[{"text": "Extract entities as JSON. " "Output ONLY valid JSON, no prose."}], messages=[{ "role": "user", "content": [{"text": ticket_text}], }], inferenceConfig={ "temperature": 0.0, # tight "topP": 0.1, # narrow vocab "maxTokens": 512, # hard ceiling "stopSequences": ["\n\n"], }, ) data = json.loads(resp["output"]["message"] ["content"][0]["text"])
import boto3 bedrock = boto3.client("bedrock-runtime") resp = bedrock.converse( modelId="anthropic.claude-3-5-sonnet-20240620-v1:0", system=[{"text": "You are a senior product marketer. " "Write in a direct, energetic voice."}], messages=[{ "role": "user", "content": [{"text": "Draft 3 launch taglines for a " "developer-focused vector database."}], }], inferenceConfig={ "temperature": 0.8, # exploratory "topP": 0.95, "maxTokens": 800, }, ) print(resp["output"]["message"] ["content"][0]["text"])
n times and rank with a separate evaluator.2.4 · Inference modes — on-demand, provisioned, batch
Once you have picked a model, you pick how Bedrock serves it. The three modes map cleanly to three traffic shapes, and the exam will give you the traffic shape and ask which mode fits.
| Mode | Best for | Pricing | Watch out for |
|---|---|---|---|
| On-demand | Unpredictable, low-to-moderate volume; prototyping; bursty production traffic. | Pay per input + output token. No commitment. | Subject to account-level token-per-minute (TPM) and request-per-minute (RPM) quotas; throttling under bursts. |
| Provisioned throughput | Steady, high-volume production; latency-sensitive workloads needing capacity guarantees; custom (fine-tuned) models. | Hourly commitment per “model unit.” 1-month or 6-month terms. | You pay for the unit whether you use it or not. Wrong for spiky traffic. |
| Batch inference | Offline scoring, bulk summarization, embedding back-catalogs, eval datasets. | ~50% discount on input + output tokens. Async, hours-scale latency. | Not for real-time. Inputs/outputs are S3 files, not API responses. |
| Cross-region inference | Production workloads that need higher effective throughput than a single region’s quota allows. | Same as on-demand — routing happens transparently. | Data may transit additional regions; check residency rules before enabling. |
2.5 · Squeezing cost without losing capability
Once a model is in production, three levers move the needle far more than swapping providers: prompt caching, model cascading, and distillation.
Prompt caching
Bedrock can cache long, repeated prompt prefixes (system instructions, retrieved-context windows, few-shot examples). A cache hit bills those tokens at a steep discount. For most RAG workloads, that is 60–90% off input tokens at zero quality cost. Mark cache breakpoints explicitly via the cachePoint content block in Converse.
Model cascading (router pattern)
Send every request first to the cheapest model that might succeed. If a confidence check fails (low logprob, schema-validation error, explicit “I am not sure”), retry on a larger model. A common pattern: Haiku → Sonnet → Opus, with ~80% of traffic terminating at Haiku.
Distillation & fine-tuning for a smaller model
Frontier model meets quality, cost does not? Distill. Collect (input, frontier-output) pairs, then fine-tune a smaller model via Bedrock Custom Models or SageMaker. You trade one-time training cost for ongoing inference savings.
Chapter summary
Model selection on AWS is a two-axis choice: family from the workload, tier from the dominant constraint.
- Two-axis selection — family from workload type; tier from dominant constraint (latency, cost, capability, or context).
- Mid-tier default — Sonnet / Nova Pro / Llama 70B. Step down or up only when a measurable signal forces it.
- Platform — Bedrock for serverless API access; SageMaker JumpStart when you need full control over the endpoint.
- Inference parameters — set
temperature,maxTokens, andsystemon every call. Tunetop_ponly when low temperature alone is not tight enough. - Throughput modes — on-demand for spiky; provisioned for steady-and-large; batch for offline. Cross-region profiles lift the ceiling without re-architecting.
- Cost optimization order — prompt caching → model cascading → distillation. Stop at the first rung that meets your budget.
The exam rewards picking the smallest model that meets the bar; it punishes ‘always Opus’.
Review Questions
Five scenario MCQs. Reveal the explanation only after you commit to an answer — the cognitive cost of guessing-then-checking is what builds exam memory.
- Use Amazon Textract to extract text and tables from the scanned documents, then use a Bedrock foundation model to structure the extracted data.
- Use a multimodal foundation model through Bedrock to directly process the scanned images and extract structured data.
- Train a custom model on SageMaker using labeled invoice samples.
- Use Amazon Comprehend to extract entities from scanned invoices.
Show answer & explanation
Correct: A. Textract is purpose-built for OCR + table extraction on scanned documents and outperforms general multimodal models on precise structured extraction. Pairing it with a Bedrock FM to format the structured output is the canonical pipeline. (B) works but is less reliable than a specialist OCR. (C) needs labeled data and training effort that is unjustified when Textract exists. (D) cannot process images directly.
- Set temperature to 1.0 and Top-P to 1.0 for maximum consistency.
- Set temperature to 0 and use a stop sequence after the classification label.
- Use fine-tuning to ensure consistent outputs.
- Set Top-K to 1 and maximum tokens to 1000.
Show answer & explanation
Correct: B. Temperature 0 makes sampling deterministic — always the most likely next token. A stop sequence terminates generation right after the label so trailing tokens cannot reintroduce variability. (A) maximizes randomness, the opposite of the requirement. (C) may improve quality but does not guarantee determinism. (D) limits choices but does not address determinism end-to-end.
- Continued pre-training on a large corpus of company documentation.
- Fine-tuning using the 500 example responses through Amazon Bedrock.
- Parameter-efficient fine-tuning (LoRA) on Amazon SageMaker.
- Prompt engineering with carefully selected few-shot examples drawn from the 500 responses.
Show answer & explanation
Correct: D, then B if insufficient. Start with prompt engineering — cheapest, fastest, often sufficient for capturing voice and terminology. If quality plateaus, escalate to Bedrock fine-tuning with the 500 examples. (A) requires far more data and compute than the budget permits. (C) saves compute over full fine-tuning, but adds SageMaker complexity. Do not pay that price until prompt engineering has visibly failed.
- Anthropic Claude Opus on on-demand inference with
temperature=0.7, results stored in S3. - Anthropic Claude Sonnet via batch inference,
temperature=0, prompt caching enabled, batch outputs versioned in S3 and (input-hash, output) cached in DynamoDB. - Amazon Nova Micro on provisioned throughput with a 6-month commitment.
- A self-managed Llama 405B endpoint on SageMaker with autoscaling.
Show answer & explanation
Correct: B. Steady, predictable, non-real-time volume is the textbook batch-inference shape — ~50% token discount with no operational change. temperature=0 plus a hash-keyed cache in DynamoDB gives replayable output for audit. Mid-tier Sonnet has the reasoning depth for 200-page documents at a fraction of Opus cost. (A) over-specs and uses creative temperature for an extraction task. (C) under-sizes capability. (D) imposes operational burden the scenario does not justify.
- Anthropic Claude Opus on provisioned throughput.
- Amazon Nova Pro via batch inference with prompt caching.
- Anthropic Claude Haiku (or Nova Lite) on on-demand inference, with
maxTokenscapped and a tight system prompt. - A SageMaker JumpStart Llama 70B endpoint with auto-scaling.
Show answer & explanation
Correct: C. Latency-dominated workloads point to small-tier models — Haiku, Nova Lite — on on-demand inference, which keeps the cold-path short. Capping maxTokens trims tail latency. (A) Opus on provisioned throughput is high-capacity but high-latency per token; over-spec’d for short replies. (B) batch is offline; it cannot meet a 1.5s SLA at all. (D) JumpStart adds endpoint operations the scenario does not justify, and 70B is overkill for short conversational replies.
You’ve reached the demo’s edge.
If the first two chapters earned the time you spent on them, the rest of the book is built the same way: decision-oriented, service-by-service, grounded in the exam’s twenty task statements. Below is what each format gets you and where to pick one up.
How the full edition continues
- Part I — Foundation Models (Chapters 1–6): you’ve seen 2 of 6. The remaining four cover Data Pipelines, Vector Stores, Retrieval (RAG), and Prompt Engineering.
- Part II — Implementation (Chapters 7–11): Bedrock vs SageMaker selection, Knowledge Bases, Agents, model evaluation, and deployment patterns.
- Part III — Security & Governance (Chapters 12–15): IAM scoping, Guardrails, PII redaction, and compliance.
- Part IV — Optimization (Chapters 16–18): cost, performance, monitoring.
- Part V — Evaluation & Troubleshooting (Chapters 19–20): metrics, drift, incident response.
- Back matter: 9-week study plan, glossary, exam-day cheat sheets.
Pick the format that fits
| Format | Best for | Delivery | Price | |
|---|---|---|---|---|
| Digital HTML | Reading on desktop or tablet — same experience as this demo, with all 20 chapters and back matter. | Single self-contained .html file. No DRM. Unlimited devices. |
$29 | Buy Digital → |
| Offline study, print-friendly, annotation in any PDF reader. | Letter-size PDF with page numbers, running headers, and recto chapter starts. | $29 | Buy PDF → | |
| Kindle | Reading on Kindle devices or the Kindle app — reflowable text. | EPUB delivered through Amazon Kindle Direct. | $19.99 | Get on Kindle → |
| Paperback | Physical reference — 6×9″ trim, perfect-bound, ships globally via Amazon. | Printed via Amazon KDP; same content as the digital editions. | $44.99 | Order Paperback → |
Was the demo useful?
If a service comparison felt thin, a decision table missed a corner case, or you spotted a fact that needs updating — write to press@minecloudcraftpress.com. Field reports from candidates studying for the exam are the only thing that keeps this guide honest. Replies usually land within 48 hours.
MineCloudCraft Press is the publishing arm of MineCloudCraft — an independent practice covering consultancy, training, and mentoring for teams building production AI on AWS. Back to MineCloudCraft Press →