Copyright & Disclaimer

Production Generative AI on AWS — A Field Manual for the Developer–Professional Exam
Edition 1.0, Demo (Chapters 1 and 2 only), published 2026.

Copyright © 2026 Pongo Tech OÜ. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law.

Published by MineCloudCraft Press, an imprint of Pongo Tech OÜ.

Independent Publication

This is an independent publication and is not affiliated with, endorsed by, sponsored by, or otherwise authorized by Amazon Web Services, Inc., Amazon.com, Inc., or any of their subsidiaries or affiliates. AWS®, Amazon Web Services®, Amazon Bedrock™, Amazon SageMaker™, and all other AWS service names, marks, and logos are trademarks of Amazon.com, Inc. or its affiliates. References to these marks in this guide are made for educational and informational purposes only.

The AWS Certified Generative AI Developer — Professional examination (AIP-C01) is administered by Amazon Web Services. This guide is a study aid prepared independently and does not represent official certification material.

No Warranty

The information in this guide is provided “as is” and without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, or non-infringement. AWS services, pricing, regional availability, and feature sets change frequently — always verify current details against official AWS documentation before making implementation or examination decisions. Neither the authors nor the publisher shall be liable for any loss of profit, business interruption, or any other commercial damages arising from use of this material.

Trademarks

All trademarks referenced in this work are the property of their respective owners. Use of a trademark in this guide does not imply endorsement of the guide by the trademark holder.

Set in

Source Serif 4 · Inter Tight · JetBrains Mono · Caveat

Format

Demo (Chapters 1–2)

Table of Contents

How to Use This Guide

A study guide is only as good as the discipline you bring to it. This one is built for the candidate who has eight to ten weeks, a full-time job, and a healthy distrust of certification fluff.

The AWS Certified Generative AI Developer — Professional exam (AIP-C01) is among the first professional-level certifications focused entirely on building production GenAI systems on AWS. This is not a survey course in machine learning, nor a prompt-engineering quiz. The exam tests your ability to make architectural and operational decisions under realistic constraints — latency budgets, compliance regimes, cost ceilings, accuracy thresholds, and vendor trade-offs.

This guide is structured to mirror that decision-making muscle. Every chapter follows the same rhythm: concept first, then services, then decision framework, then practice. You will not find encyclopedic dumps of every AWS feature. You will find the features the exam actually probes, organized so you can recall them under time pressure. Along the way you’ll meet a small set of recurring visual primitives — mental-model figures, service spotlight cards, comparison tables, decision lists, and code-compare panes — that earn their place by carrying the highest-leverage decisions of each chapter.

Who this guide is for

AIP-C01 is a professional-level certification. The content of this book is correspondingly intermediate to advanced and assumes you have already built or shipped at least one cloud application on AWS — you have used IAM, VPCs, S3, Lambda, CloudWatch, and at least one compute or container service in anger — and that you have hands-on familiarity with at least one large language model and a basic understanding of how vector retrieval and embeddings work.

If you are entirely new to AWS or to generative AI, start with the AWS Cloud Practitioner and AI Practitioner certifications first; this guide will move faster than is comfortable for absolute beginners. We will not stop to explain what an S3 bucket is, what an IAM role looks like, what a VPC endpoint does, or what a foundation model is — the exam doesn’t and neither do we.

You are likely:

A backend or platform engineer with several years of AWS experience, now integrating foundation models into production systems.
An ML engineer with classical-ML background expanding into generative workloads.
A solutions architect needing to defend GenAI design choices to security, finance, and product stakeholders.
A consultant who needs to walk into client conversations with current, exam-grade fluency in the AWS GenAI stack.

Treat anything that lands as “new to me, but I’ve seen the building blocks” as the target reading level. If a chapter feels like it’s assuming things you’ve never touched (a VPC endpoint policy, a SageMaker endpoint, a Lambda execution role), pause — build the missing block in the console for half an hour — then come back. The exam is hands-on by design; the book is hands-on by design.

How to read this book

Each chapter is self-contained and can be read in isolation, but the parts build on each other. Domain 1 establishes vocabulary. Domain 2 covers integration patterns that the rest of the book assumes. Domains 3, 4, and 5 each focus on a non-functional concern: safety, efficiency, and quality. The pattern within each chapter is consistent:

Chapter Anatomy

What every chapter contains

OpenLearning objectives

CoreConcepts & AWS services

PatternDecision framework

DrillPractice MCQs

CloseSummary & recap

Three callout types appear throughout. Read them — they carry the highest density of exam-relevant material:

A nine-week study plan

The plan below assumes about 10–12 hours per week of focused study. If you have more, accelerate; if you have less, stretch the schedule and prioritize Weeks 1, 4, and 9.

Recommended Weekly Schedule
Week	Focus	Outcome
1	Read Part I (Domain 1, Chapters 1–6) — foundation models, RAG, vector stores, prompts	Architect a RAG solution on paper
2	Hands-on lab: Bedrock + Knowledge Bases + OpenSearch Serverless	End-to-end RAG demo working
3	Read Part II (Domain 2, Chapters 7–11) — agents, deployment, FM APIs, MLOps	Build a working Bedrock Agent with tool calls
4	Read Part III (Domain 3, Chapters 12–15) — guardrails, encryption, governance, responsible AI	Configure Bedrock Guardrails + CloudTrail logging
5	Read Part IV (Domain 4, Chapters 16–18) — cost, latency, monitoring	Build a CloudWatch dashboard for an LLM workload
6	Read Part V (Domain 5, Chapters 19–20) — evaluation, troubleshooting	Implement an LLM-as-a-judge eval pipeline
7	Walk Back Matter A (Exam Strategy) + B (Glossary, spaced repetition)	Internalize the five-pass MCQ procedure
8	Practice exams · review missed questions · re-read flagged callouts	Score ≥ 80% on practice consistently
9	Final review · Cheat Sheets (Back Matter C) · book the exam	Pass on first attempt

What this guide is not

This is not an AWS service catalog, not a Python tutorial, and not a substitute for hands-on practice. Where source material includes long code listings, we have summarized the conceptual takeaway and pointed you to AWS documentation for current SDK syntax. Treat the official AWS documentation as canonical for any code you intend to ship. The Back Matter sections (Exam Strategy, Glossary, Cheat Sheets) are where the book’s decisions distill into something you can re-read in the fifteen minutes before the test — budget time for them.

A note from the author

I wrote this guide for myself first. I was preparing for the AIP-C01 exam in early 2026 and could not find a study resource that combined the depth I needed with the structure my brain wanted — concepts, then services, then decision frameworks, then drills, with no padding. So I built one. I sat the exam, I passed, and the material in your hands is the same material I used to get there.

This is not an official AWS publication, and it is not a replacement for hands-on practice or for the AWS documentation. It is an opinionated, decision-oriented study guide — a real candidate’s playbook rather than a marketing document. If it helped me pass, I’m confident it can help you pass too.

Good luck. Now: open Part I, brew something hot, and let’s begin.

End of Preface

Exam Overview

The AIP-C01 Exam, In One Chapter

Format, scoring, domain weightings, and a realistic look at what the question paper actually feels like.

The AWS Certified Generative AI Developer — Professional exam (code: AIP-C01) is a 180-minute, scenario-driven examination consisting of 75 questions — 65 scored and 10 unscored. You won’t be told which are which; treat them all as scored. It’s delivered through Pearson VUE testing centers and as an online proctored exam, and it carries a recommended prerequisite of two or more years of hands-on experience designing and operating GenAI workloads on AWS.

You do not need to memorize service quotas or current pricing. You do need to be able to look at a multi-paragraph scenario — complete with constraints, red herrings, and competing priorities — and pick the architecture that satisfies the requirements at the lowest reasonable cost and operational burden.

Domain weightings

Five domains are weighted as follows. The percentages are official AWS guidance; treat them as your study budget allocator.

AIP-C01 Domain Weighting
#	Domain	Weight	Tasks
1	Foundation Model Integration & Data Management	31%	1.1 – 1.6
2	Implementation & Integration	26%	2.1 – 2.5
3	Security, Compliance & Governance	20%	3.1 – 3.4
4	Operational Efficiency & Optimization	12%	4.1 – 4.3
5	Evaluation & Troubleshooting	11%	5.1 – 5.2

Question formats

The exam uses three question formats. Knowing the difference matters because the marking rules differ.

Multiple choice — one stem, four options, exactly one correct answer. The most common format.
Multiple response — one stem, five or more options, two or three correct answers. Partial credit is not awarded; you must select the exact correct subset.
Scenario / case study — a long preamble (architecture diagram, customer requirements, constraints) followed by 2–4 dependent questions. Read the preamble carefully before starting; the same setup feeds multiple questions.

Passing standard

The reported passing score is approximately 750 / 1000, but AWS uses a scaled scoring model with statistical equating. Aim for ≥ 80% on practice exams to be comfortable on test day. Score reports break results down by domain (“Meets Competencies / Needs Improvement”); use those to direct your final review week.

Time budget

180 minutes for 75 questions = ~2:24 per question average. In practice, scenario questions consume 3–5 minutes each, while plain multiple-choice items can be answered in under a minute. Use the in-exam Flag for Review feature liberally; first-pass anything you cannot answer in 90 seconds, finish the easy questions, and circle back. Back Matter A (Exam Strategy) walks the full five-pass pacing plan in detail.

Time Budget

A realistic 180-minute pacing plan

Pass 1~90 min · touch every question once, answer the easy ones, flag the hard

Pass 2~50 min · work the flagged set deeply, commit to answers

Pass 3~25 min · revisit still-flagged few with fresh eyes

Pass 4–5~15 min · gut-check & verify nothing is blank

What the exam loves to test

Across all five domains, expect heavy emphasis on these decision pivots. These are the seams where two services overlap and the “right” answer depends on a single qualifier in the question:

Bedrock vs. SageMaker JumpStart vs. self-hosted — managed convenience vs. customization vs. control.
RAG vs. fine-tuning vs. continued pre-training — data freshness, knowledge depth, cost.
Amazon Bedrock Knowledge Bases vs. Kendra vs. raw OpenSearch — managed vs. document ACLs vs. flexibility.
On-demand vs. provisioned throughput vs. batch inference — latency, cost, predictability.
Bedrock Agents vs. custom orchestration with the Converse API — managed vs. flexible.
Bedrock Guardrails vs. application-level filters vs. Amazon Comprehend — safety surface area.

Each of these pivots gets its own decision framework in the chapters that follow.

End of Exam Overview

01 Part I · Chapter 1 · Task 1.1

Analyze Requirements & Design GenAI Solutions

Before you write a single line of Bedrock code, you have to translate a business problem into an architecture. This chapter teaches you how to decide whether GenAI is even the right tool, and — if it is — which of five canonical patterns to reach for.

GenAI in Real Life — The same $0.002 model call is a waste in one app and a steal in another. The four-question checklist in §1.1 exists to keep you on the right side of this line.

1.1 · Is this even a GenAI problem?

The most expensive mistake in generative AI is using a foundation model where a regular expression would do. Before you reach for Bedrock, work through a four-question checklist. If you cannot answer yes to at least three, your problem belongs to traditional ML, classic search, or simple business logic. The exam punishes over-engineering.

Does the task require language understanding, generation, or transformation? — summarization, drafting, translation, intent extraction. If the answer is “classification with structured features,” reach for Amazon SageMaker or Amazon Comprehend instead.
Is the input variable, unstructured, or open-ended? — free-form support tickets, PDFs, conversational queries. Foundation models excel at variability.
Can the system tolerate probabilistic output? — if every response must be 100% accurate (think: tax calculations, medical dosages), you need a deterministic system underneath, with the LLM only as an orchestrator.
Is sufficient context obtainable? — either through prompts, retrieval, or fine-tuning. If the answer lives in a database the model cannot reach, no amount of prompting will help.

Common GenAI use case categories

The AIP-C01 exam organizes GenAI use cases into five categories. Memorize the canonical AWS service for each — questions often hide the use case in a verb (“summarize”, “classify”, “generate”) rather than name it directly.

Canonical GenAI Use Cases & Default AWS Services
Category	Examples	Default service
Text generation	Email drafting, content creation, conversational assistants	Amazon Bedrock (Anthropic Claude, Meta Llama, Amazon Nova)
Code generation	Code completion, refactoring, test generation	Amazon Q Developer (Bedrock under the hood)
Summarization & extraction	Long-document summaries, structured field extraction	Bedrock + structured output (tool use / JSON mode)
Image & multimodal	Image creation, visual Q&A, document understanding	Bedrock (Stable Diffusion, Amazon Titan Image, Nova Canvas)
Search & knowledge	Natural-language Q&A over private data	Bedrock Knowledge Bases · Amazon Kendra · custom RAG on OpenSearch

1.2 · Functional & non-functional requirements

Once you have decided GenAI is appropriate, the requirements analysis you would do for any system splits into three buckets. The exam is unusually fond of latency / cost / compliance trade-offs, so each row in the table below is a likely question seam.

Functional requirements

What the solution must accomplish. For foundation-model workloads this includes: input modalities (text, images, audio, documents); expected output format and grounding requirements; required domain knowledge; integration points; and the interaction pattern (synchronous chat, streaming, batch). One requirement matters most: does the app need multi-turn conversational state? If yes, you need session management. If no, a stateless InvokeModel call works.

Non-functional requirements

Latency, throughput, cost, availability, security, and compliance. These usually determine the architecture rather than influence it.

Non-Functional Requirements → AWS Mapping
Concern	Diagnostic question	AWS lever
Latency	Is end-user response time under 2 s critical?	Smaller model · Bedrock provisioned throughput · response streaming
Throughput	Sustained TPS at peak?	Provisioned throughput · SageMaker auto-scaling · async inference
Cost	Cost per inference / monthly ceiling?	Model right-sizing · Bedrock batch · prompt & semantic caching
Availability	RTO / RPO requirements?	Multi-region deployment · Bedrock cross-region inference
Security	Data classification & access control?	IAM · AWS KMS · VPC endpoints · Bedrock Guardrails
Compliance	HIPAA / PCI / GDPR / SOC 2?	Regional deployment · CloudTrail · AWS Config · data residency
Data volume	How much context / how much training data?	S3 · OpenSearch · Kendra · SageMaker training
Accuracy	Quality threshold & tolerance for errors?	Bedrock Guardrails · human-in-loop · automated evaluation

1.3 · The five canonical AWS GenAI patterns

Almost every architecture the exam asks you to design is a composition of these five patterns. Memorize their trigger conditions; the exam phrases scenarios specifically so that exactly one pattern fits.

Figure 1.1 · Mental Model

Pattern selection tree — which of the five fits the problem?

Walk down the tree from the top. Stop at the first “yes.” The exam writes scenarios so that exactly one branch fits cleanly.

Read it like this: the green branch is the cheapest pattern that satisfies the requirement. Every “no” spends more tokens, more engineering effort, or both. Always justify the next branch — the exam loves to test the over-engineered choice.

Pattern 1 · Direct API integration

The simplest pattern: your application calls InvokeModel or Converse on Amazon Bedrock, the foundation model returns a response, and your code post-processes it. Conversation state lives in your application layer (DynamoDB, ElastiCache, or memory). Compute is typically AWS Lambda or a container on Amazon ECS / EKS.

Use this pattern when the model’s training data is sufficient knowledge for the task — summarization, translation, brainstorming, generic Q&A, classification of free-text input. The moment you need your data in the response, you graduate to RAG.

InvokeModel vs. Converse — the same call, two eras. The older InvokeModel API requires a model-specific JSON body. Anthropic, Meta, and Cohere expect different shapes. Swapping models means rewriting the request. The newer Converse API normalizes that — one request shape, one response shape, across every Bedrock-hosted model. Use Converse for anything new; treat InvokeModel as legacy.

InvokeModel (legacy)Per-model JSON

import boto3, json

bedrock = boto3.client("bedrock-runtime")

# Anthropic-specific body shape — different
# for Llama, Cohere, Titan, etc.
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 512,
    "messages": [
        {"role": "user",
         "content": "Summarize CRISPR in 3 lines."}
    ],
}

resp = bedrock.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    body=json.dumps(body),
    contentType="application/json",
)

# Parse model-specific response shape
data = json.loads(resp["body"].read())
print(data["content"][0]["text"])

Couples your code to one model’s schema. Switching models means rewriting both the request body and the response parser.

Converse (recommended)Provider-agnostic

import boto3

bedrock = boto3.client("bedrock-runtime")

# Same shape works for Claude, Llama, Nova,
# Cohere, Mistral — swap modelId only.
resp = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    messages=[
        {"role": "user",
         "content": [{"text": "Summarize CRISPR in 3 lines."}]}
    ],
    inferenceConfig={"maxTokens": 512},
)

# Uniform response shape
text = resp["output"]["message"]["content"][0]["text"]
print(text)
# resp["stopReason"] tells you why generation ended
# resp["usage"] gives input/output token counts

One call shape for every model. Native multi-turn, streaming (converse_stream), and tool use without per-provider hacks.

Service

Amazon Bedrock Converse API

Provider-agnostic API that handles multi-turn message state, tool use, and streaming uniformly across Bedrock-hosted models. Replaces the older InvokeModel for all conversational workloads.

Use when

You want a single integration that works across Anthropic, Meta, Cohere, Mistral, and Amazon Nova models without rewriting your client code per provider.

Pattern 2 · Retrieval-Augmented Generation (RAG)

RAG is the dominant enterprise pattern. The workflow: embed the user query, search a vector store for relevant chunks, splice the retrieved text into the prompt, then call the foundation model. The model now “knows” about your private data without ever being trained on it.

RAG is the right answer when the source-of-truth changes more frequently than you can retrain a model (most enterprise data), when you must cite sources, or when the corpus is too large to fit in any context window. Two implementation paths:

Managed RAG — Amazon Bedrock Knowledge Bases handles ingestion, chunking, embedding, and retrieval. Lowest operational overhead. Choose this unless you need something it does not do.
Custom RAG — you build the pipeline yourself using Amazon OpenSearch Service (or Aurora pgvector), Bedrock embedding models, and your own orchestration. Choose this when you need fine control over chunking, hybrid search, custom metadata filtering, or non-AWS vector databases.

Pattern 3 · Agents and tool use

An agent extends a foundation model with the ability to take actions. The model decides which function to call, supplies arguments, the function executes (a Lambda function, an API call, a database query), the result is fed back, and the agent decides what to do next. This loop continues until the agent has enough information to respond.

Amazon Bedrock Agents wraps this with managed orchestration: you define an instruction, register action groups (Lambda functions or OpenAPI schemas), optionally attach knowledge bases, and Bedrock handles the reasoning loop. For more flexibility, build a custom agent on top of the Bedrock Converse API’s tool-use feature.

Service

Amazon Bedrock Agents

Managed orchestration of multi-step tool-using workflows. Handles prompt construction, tool selection, parameter inference, and conversation state. Integrates natively with Lambda action groups and Knowledge Bases.

Use when

The task requires multi-step actions across systems (lookup → decide → act → confirm) and you do not want to hand-roll the orchestration loop in your application code.

Pattern 4 · Fine-tuning & custom models

Fine-tuning adapts a foundation model’s weights using your data. Reach for it when prompting and RAG cannot achieve the consistency you need — specialized output format, domain vocabulary, or brand voice. Bedrock supports continued pre-training and fine-tuning for select models; SageMaker JumpStart offers more model choice and full training control.

Pattern 5 · Multi-model & ensemble

A small classifier routes traffic to a large generation model only when needed; an embedding model handles search, while a generation model handles synthesis; multiple models propose answers and a judge selects the best. Bedrock’s unified API makes this trivial to compose — you can call Claude, Titan, and Stable Diffusion from one application without managing three SDKs.

1.4 · Cost optimization at design time

Cost is decided at the architecture stage, not at the bill stage. Three levers dominate.

Right-size the model. The largest model is rarely the right model. Try the smallest capable model first, measure quality, escalate only if needed. A factor-of-10 cost differential between Nova Micro and Claude Opus is typical.
Pick the right inference mode. On-demand pricing for variable workloads, provisioned throughput for predictable steady traffic, batch inference for offline workloads (up to 50% cheaper). Mismatched mode is the #1 cause of bill shock.
Cache aggressively. Bedrock prompt caching reuses repeated system prompts at a discount. Application-level caching of identical queries can eliminate 20–60% of inference calls. Semantic caching (vector-similarity match on prior queries) extends this further.

1.5 · The Well-Architected GenAI lens, in summary

AWS Well-Architected gains one dimension for GenAI: Responsible AI. It threads across the existing six pillars rather than replacing them.

Well-Architected GenAI Concerns by Pillar
Pillar	What to verify in a GenAI workload
Operational Excellence	Model versioning, prompt versioning, evaluation pipelines, automated rollback on quality regression.
Security	Data classification at ingest, KMS encryption, VPC isolation, IAM least-privilege per action group, Guardrails for input/output safety.
Reliability	Cross-region inference fallback, retry & backoff for throttling, graceful degradation when a model is unavailable.
Performance Efficiency	Right-sized models, response streaming, dynamic batching, semantic caching, parallel tool execution.
Cost Optimization	Token budgets, model cascading, batch inference, prompt caching, monitoring per-request cost.
Sustainability	Reuse of cached responses, batch over real-time when possible, choosing efficient models, regional placement.
Responsible AI	Bias evaluation, transparency & citation, opt-out, harm mitigation, fairness across user segments.

Pro Tip · Start simple, escalate only when the problem demands it

The same “walk down the tree” logic that picks a pattern also picks an implementation stack. Frameworks like LangGraph, Strands Agents, vector databases (OpenSearch Serverless, Pinecone, pgvector), and orchestrators like AWS Step Functions or Amazon Bedrock AgentCore are powerful — but each one adds operational surface, latency, and cost.

If your problem fits Pattern 1 — a single-turn or short-conversation prompt where you can hold context in DynamoDB, ElastiCache, or even Lambda memory and call Converse directly — do that. Don’t reach for a graph runtime to orchestrate one model call. Don’t stand up a vector store before you’ve confirmed the model can’t answer from its training data. Don’t wrap a Step Functions state machine around a single tool invocation.

The escalation ladder: in-process state & Converse → managed memory (DynamoDB / ElastiCache) → Bedrock Knowledge Bases for retrieval → Bedrock Agents for tool use → AgentCore or Step Functions for multi-agent / multi-step workflows → LangGraph / Strands when you’ve outgrown what managed services express. Move up the ladder only when the previous rung visibly fails. Every skipped rung is complexity you pay for in operations, debugging, and the exam. The simplest answer that meets the requirement is almost always right.

Chapter summary

Designing GenAI is choosing the right tool, the right pattern, and the right model for the constraints in front of you.

GenAI fit first — validate that GenAI is the right tool before designing. Not every problem needs a foundation model.
NFRs → AWS levers — map latency, cost, and compliance onto specific services. Compliance usually eliminates options first.
Five core patterns — Direct API, RAG, Agents, Fine-Tuning, Multi-Model. Almost every workload is one of these.
Pattern triggers — RAG when knowledge is external or fresh; Agents when the system must take action; Fine-Tuning is the last resort, not the first.
Cost is architectural — right-size the model, pick the right inference mode, cache. Decided at design time, not after launch.
Well-Architected + Responsible AI — weave the Responsible AI thread across all six pillars.

The exam rewards architecture-time decisions; it punishes ‘biggest model on every problem’.

Review Questions

Five scenario MCQs. Reveal the explanation only after you commit to an answer — the cognitive cost of guessing-then-checking is what builds exam memory.

Question 1

A healthcare company wants to build a chatbot that answers patient questions about their insurance coverage. The information lives in PDFs that are updated quarterly. The application must comply with HIPAA, and patients must only see information relevant to their specific plan. Which architecture best satisfies these requirements?

Deploy a fine-tuned model on SageMaker trained on all insurance documents.
Use Amazon Bedrock Knowledge Bases with OpenSearch Serverless, applying metadata filtering by plan type, with VPC endpoints and KMS encryption.
Use Amazon Kendra with an S3 data source and document-level access control lists, integrated with a Bedrock foundation model for response generation.
Provide all insurance documents inside the system prompt of a Bedrock foundation model.

Show answer & explanation

Correct: C. Amazon Kendra provides built-in document-level ACLs that map directly to the per-patient access requirement, and its native S3 connector handles PDF ingestion. Combined with a Bedrock model for natural-language response, this pattern satisfies access control, freshness, and compliance.

Why not B? Bedrock Knowledge Bases supports metadata filtering, but document-level user-aware ACLs require additional custom logic; Kendra offers this natively. Why not A? Fine-tuning bakes data into weights and cannot enforce per-user access. Why not D? Quarterly insurance corpora exceed any context window and offer no per-user filtering.

Question 2

A financial-services firm needs to summarize 50,000 quarterly earnings reports overnight. Real-time output is not required and cost optimization is the primary concern. Which approach fits?

Bedrock on-demand inference with parallel Lambda fan-out.
Bedrock batch inference processing all reports as a single batch job.
SageMaker endpoint with provisioned capacity.
Bedrock with provisioned throughput.

Show answer & explanation

Correct: B. Bedrock batch inference is purpose-built for high-volume offline jobs at a meaningful per-token discount versus synchronous inference. With no real-time requirement, every other option pays a real-time premium that batch avoids.

Question 3

A retail company wants an assistant that answers product questions, checks order status, and processes returns. It must call the order-management API and a product-catalog database. Which architecture is most suitable?

An Amazon Bedrock Agent with action groups for order management and returns, plus a Knowledge Base for product information.
A RAG pipeline on OpenSearch containing both product and order data.
A fine-tuned model trained on product catalog and historical orders.
A direct Bedrock InvokeModel call with the order API documented in the system prompt.

Show answer & explanation

Correct: A. The requirement combines knowledge retrieval (product catalog) with multi-step actions (order lookup, returns) — exactly the agent pattern. Bedrock Agents support both action groups and Knowledge Bases natively in one runtime. RAG alone cannot act, fine-tuning cannot reach a live order system, and stuffing API specs into a prompt does not give the model a way to actually call them.

Question 4

A team is choosing between RAG and fine-tuning to make a foundation model produce strict JSON output matching an internal schema. Output format consistency is the only requirement; the underlying knowledge is generic. Which option is most appropriate and cost-effective?

Fine-tune a small Bedrock model on thousands of input/output examples.
Build a RAG pipeline that retrieves the schema and includes it in every prompt.
Use the Bedrock Converse API with tool use / structured output to enforce the schema.
Continue pre-training a model on schema documentation.

Show answer & explanation

Correct: C. Structured output via tool use enforces the schema at decoding time without retraining or retrieval overhead. Fine-tuning is overkill for format-only adaptation; RAG injects context but does not enforce structure; continued pre-training is even heavier than fine-tuning and inappropriate here.

Question 5

A startup is prototyping a customer-support assistant. Traffic is unpredictable, will likely be low-volume for the first six months, and the team has no capacity for ongoing infrastructure work. Which inference mode is the best starting point?

Bedrock provisioned throughput with a one-month commitment.
Bedrock on-demand inference.
A self-managed model on a SageMaker real-time endpoint.
SageMaker serverless inference with a custom container.

Show answer & explanation

Correct: B. On-demand pricing matches unpredictable, low-volume workloads with zero commitment and no capacity management. Provisioned throughput requires a commitment that does not match the volume profile; SageMaker options shift operational burden onto a team that explicitly cannot absorb it.

End of Chapter 1

02 Part I · Chapter 2 · Task 1.2

Select & Configure Foundation Models

Picking a model is not a vibe check. It is a constrained optimization across capability, latency, cost, context window, modality, and deployment surface. This chapter teaches you to read a scenario, list the constraints, and converge on the one model the exam expects.

GenAI in Real Life — A 6-second password-reset is a 31% abandonment rate. Same product, same user, same prompt — the only thing that changed was which model answered.

2.1 · The Bedrock model landscape

Amazon Bedrock is a managed surface for foundation models from Anthropic, Meta, Amazon Nova, Mistral, Cohere, AI21, DeepSeek, Writer, and Luma. You do not provision GPUs. You do not manage weights. You call an API. The exam expects you to know each family’s sweet spot — well enough to pick from four choices under time pressure.

Bedrock model families — what each is good at
Family	Sweet spot	Watch out for
Anthropic Claude Sonnet · Haiku · Opus	Long-context reasoning, tool use, instruction following, code, structured output. Default choice for agents and complex RAG.	Higher per-token cost than Haiku-tier models; Opus tier is slow.
Amazon Nova Micro · Lite · Pro · Premier	AWS-native, lowest cost for tiered workloads, native multimodal (Lite/Pro), tight Bedrock integration.	Newer ecosystem; some advanced reasoning still trails Claude/GPT-class peers.
Amazon Titan Text · Embeddings · Image	Embeddings (`amazon.titan-embed-text-v2:0`) are a default for RAG on AWS. Image generation when staying in-house.	For new generation work, AWS is positioning the Nova family alongside Titan Text.
Meta Llama	Open-weight reasoning, customer wants portability or self-hosting on SageMaker.	Capabilities lag closed-weight peers at the same parameter count.
Cohere Command / Embed	Multilingual embeddings, retrieval, classification, RAG with strong non-English support.	Smaller community for tool-use patterns.
Mistral	Cost-effective European hosting, fast small-model inference, function calling.	Smaller context windows on lower tiers.
Stability AI	Image generation (SD3, SDXL) when output must be stylistically tunable.	Not a chat model; do not pick it for text tasks.

2.2 · The selection trade-off — one mental model, four axes

Every model selection question collapses into four axes. The exam phrases scenarios so that one axis dominates — identify it, and the answer falls out.

Figure 2.1 · Mental Model

The four-axis model selection radar

When two axes pull in opposite directions, the dominant constraint — latency, cost, capability, or context — decides.

Figure 2.1 · Model tier trade-off radar. All four axes are upward goods — capability and context measure raw power; speed and affordability invert latency and cost so wider polygon always means “better.” Small models dominate the speed / affordability side, frontier models reach further on capability / context, mid tier sits in the middle — usually the right starting point.

2.3 · Inference parameters that actually matter

Bedrock’s Converse API exposes the same handful of inference parameters across every provider. Most candidates can name them. Few can predict what changes when you turn each one. The exam loves this gap.

Inference parameters — what they do, when to change them
Parameter	Effect	When to change it
`temperature`	Scales the logit distribution. Low (0–0.3) → deterministic, repeatable. High (0.7–1.0) → creative, variable.	Use 0–0.2 for extraction, classification, function-calling. Use 0.7+ for brainstorming, marketing copy, creative drafts.
`top_p`	Nucleus sampling: keep tokens whose cumulative probability is ≤ `p`. Low `p` = narrower vocabulary.	Combine with low `temperature` for tight, factual output. Rarely tune both at once — pick one knob.
`top_k`	Keep only the top `k` next-token candidates. Hard cap on diversity.	Useful for very narrow domains (SQL, JSON only). Most providers default to a sensible value — leave alone unless you have a measured reason.
`max_tokens`	Hard cap on response length. Generation stops at `max_tokens` or stop sequence, whichever first.	Always set it. It caps cost and stops runaway loops. Tune to the realistic 95th percentile of your task.
`stopSequences`	List of strings that, if emitted, terminate generation immediately.	Use for structured output (`"\n\n"`, `""`) and when chaining prompts.
`system`	Persistent role / persona / rules above the conversation.	Always set it. The system prompt is where guardrails, tone, and output schema live.

Configuring inference with the Converse API

Deterministic extractionLow temp · structured

import boto3, json

bedrock = boto3.client("bedrock-runtime")

resp = bedrock.converse(
    modelId="anthropic.claude-3-5-haiku-20241022-v1:0",
    system=[{"text":
        "Extract entities as JSON. "
        "Output ONLY valid JSON, no prose."}],
    messages=[{
        "role": "user",
        "content": [{"text": ticket_text}],
    }],
    inferenceConfig={
        "temperature": 0.0,    # tight
        "topP": 0.1,          # narrow vocab
        "maxTokens": 512,     # hard ceiling
        "stopSequences": ["\n\n"],
    },
)
data = json.loads(resp["output"]["message"]
                  ["content"][0]["text"])

Use for entity extraction, classification, JSON output, function-calling, anything that must round-trip cleanly into downstream code.

Creative draftingHigher temp · open-ended

import boto3

bedrock = boto3.client("bedrock-runtime")

resp = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    system=[{"text":
        "You are a senior product marketer. "
        "Write in a direct, energetic voice."}],
    messages=[{
        "role": "user",
        "content": [{"text":
            "Draft 3 launch taglines for a "
            "developer-focused vector database."}],
    }],
    inferenceConfig={
        "temperature": 0.8,    # exploratory
        "topP": 0.95,
        "maxTokens": 800,
    },
)
print(resp["output"]["message"]
     ["content"][0]["text"])

Use for ideation, copywriting, multiple-candidate generation. Pair with sampling: call n times and rank with a separate evaluator.

2.4 · Inference modes — on-demand, provisioned, batch

Once you have picked a model, you pick how Bedrock serves it. The three modes map cleanly to three traffic shapes, and the exam will give you the traffic shape and ask which mode fits.

Bedrock inference modes — matching workload to mode
Mode	Best for	Pricing	Watch out for
On-demand	Unpredictable, low-to-moderate volume; prototyping; bursty production traffic.	Pay per input + output token. No commitment.	Subject to account-level token-per-minute (TPM) and request-per-minute (RPM) quotas; throttling under bursts.
Provisioned throughput	Steady, high-volume production; latency-sensitive workloads needing capacity guarantees; custom (fine-tuned) models.	Hourly commitment per “model unit.” 1-month or 6-month terms.	You pay for the unit whether you use it or not. Wrong for spiky traffic.
Batch inference	Offline scoring, bulk summarization, embedding back-catalogs, eval datasets.	~50% discount on input + output tokens. Async, hours-scale latency.	Not for real-time. Inputs/outputs are S3 files, not API responses.
Cross-region inference	Production workloads that need higher effective throughput than a single region’s quota allows.	Same as on-demand — routing happens transparently.	Data may transit additional regions; check residency rules before enabling.

2.5 · Squeezing cost without losing capability

Once a model is in production, three levers move the needle far more than swapping providers: prompt caching, model cascading, and distillation.

Figure 2.2 · Mental Model

The cost-optimization escalation ladder

Climb only as high as your budget demands. Each rung up adds engineering effort and operational surface; each delivers larger compounding savings.

Figure 2.2 · Cost-optimization escalation ladder. Three levers, ordered by effort. Climb the ladder; do not skip rungs. Caching is free quality and the right first move; cascading buys 50–80% on top; distillation is the most expensive intervention and the most permanent one.

Prompt caching

Bedrock can cache long, repeated prompt prefixes (system instructions, retrieved-context windows, few-shot examples). A cache hit bills those tokens at a steep discount. For most RAG workloads, that is 60–90% off input tokens at zero quality cost. Mark cache breakpoints explicitly via the cachePoint content block in Converse.

Model cascading (router pattern)

Send every request first to the cheapest model that might succeed. If a confidence check fails (low logprob, schema-validation error, explicit “I am not sure”), retry on a larger model. A common pattern: Haiku → Sonnet → Opus, with ~80% of traffic terminating at Haiku.

Distillation & fine-tuning for a smaller model

Frontier model meets quality, cost does not? Distill. Collect (input, frontier-output) pairs, then fine-tune a smaller model via Bedrock Custom Models or SageMaker. You trade one-time training cost for ongoing inference savings.

Chapter summary

Model selection on AWS is a two-axis choice: family from the workload, tier from the dominant constraint.

Two-axis selection — family from workload type; tier from dominant constraint (latency, cost, capability, or context).
Mid-tier default — Sonnet / Nova Pro / Llama 70B. Step down or up only when a measurable signal forces it.
Platform — Bedrock for serverless API access; SageMaker JumpStart when you need full control over the endpoint.
Inference parameters — set temperature, maxTokens, and system on every call. Tune top_p only when low temperature alone is not tight enough.
Throughput modes — on-demand for spiky; provisioned for steady-and-large; batch for offline. Cross-region profiles lift the ceiling without re-architecting.
Cost optimization order — prompt caching → model cascading → distillation. Stop at the first rung that meets your budget.

The exam rewards picking the smallest model that meets the bar; it punishes ‘always Opus’.

Review Questions

Five scenario MCQs. Reveal the explanation only after you commit to an answer — the cognitive cost of guessing-then-checking is what builds exam memory.

Question 1

A company needs a document-understanding system that processes invoices containing both text and tables. The invoices are scanned images in various formats. The system must extract structured data including vendor name, invoice number, line items, and totals. Which approach is most appropriate?

Use Amazon Textract to extract text and tables from the scanned documents, then use a Bedrock foundation model to structure the extracted data.
Use a multimodal foundation model through Bedrock to directly process the scanned images and extract structured data.
Train a custom model on SageMaker using labeled invoice samples.
Use Amazon Comprehend to extract entities from scanned invoices.

Show answer & explanation

Correct: A. Textract is purpose-built for OCR + table extraction on scanned documents and outperforms general multimodal models on precise structured extraction. Pairing it with a Bedrock FM to format the structured output is the canonical pipeline. (B) works but is less reliable than a specialist OCR. (C) needs labeled data and training effort that is unjustified when Textract exists. (D) cannot process images directly.

Question 2

An application requires a foundation model that produces deterministic outputs for a classification task. The same input must produce the same label across multiple requests. Which configuration is most important?

Set temperature to 1.0 and Top-P to 1.0 for maximum consistency.
Set temperature to 0 and use a stop sequence after the classification label.
Use fine-tuning to ensure consistent outputs.
Set Top-K to 1 and maximum tokens to 1000.

Show answer & explanation

Correct: B. Temperature 0 makes sampling deterministic — always the most likely next token. A stop sequence terminates generation right after the label so trailing tokens cannot reintroduce variability. (A) maximizes randomness, the opposite of the requirement. (C) may improve quality but does not guarantee determinism. (D) limits choices but does not address determinism end-to-end.

Question 3

A company wants to adapt a foundation model to generate customer-support responses in their brand voice using product-specific terminology. They have 500 examples of ideal responses and a limited compute budget. Which customization approach should they try first?

Continued pre-training on a large corpus of company documentation.
Fine-tuning using the 500 example responses through Amazon Bedrock.
Parameter-efficient fine-tuning (LoRA) on Amazon SageMaker.
Prompt engineering with carefully selected few-shot examples drawn from the 500 responses.

Show answer & explanation

Correct: D, then B if insufficient. Start with prompt engineering — cheapest, fastest, often sufficient for capturing voice and terminology. If quality plateaus, escalate to Bedrock fine-tuning with the 500 examples. (A) requires far more data and compute than the budget permits. (C) saves compute over full fine-tuning, but adds SageMaker complexity. Do not pay that price until prompt engineering has visibly failed.

Question 4

A retail bank deploys an internal assistant that summarizes 200-page loan files for underwriters. Volume is steady at ~4,000 documents per day, and the same input must produce the same summary on re-run for audit. Cost is the second concern after auditability. Which configuration best fits?

Anthropic Claude Opus on on-demand inference with temperature=0.7, results stored in S3.
Anthropic Claude Sonnet via batch inference, temperature=0, prompt caching enabled, batch outputs versioned in S3 and (input-hash, output) cached in DynamoDB.
Amazon Nova Micro on provisioned throughput with a 6-month commitment.
A self-managed Llama 405B endpoint on SageMaker with autoscaling.

Show answer & explanation

Correct: B. Steady, predictable, non-real-time volume is the textbook batch-inference shape — ~50% token discount with no operational change. temperature=0 plus a hash-keyed cache in DynamoDB gives replayable output for audit. Mid-tier Sonnet has the reasoning depth for 200-page documents at a fraction of Opus cost. (A) over-specs and uses creative temperature for an extraction task. (C) under-sizes capability. (D) imposes operational burden the scenario does not justify.

Question 5

A team is building a real-time customer-facing chatbot. P95 latency must stay under 1.5 seconds for short conversational replies. Cost matters but is secondary. Which Bedrock configuration is the strongest first choice?

Anthropic Claude Opus on provisioned throughput.
Amazon Nova Pro via batch inference with prompt caching.
Anthropic Claude Haiku (or Nova Lite) on on-demand inference, with maxTokens capped and a tight system prompt.
A SageMaker JumpStart Llama 70B endpoint with auto-scaling.

Show answer & explanation

Correct: C. Latency-dominated workloads point to small-tier models — Haiku, Nova Lite — on on-demand inference, which keeps the cold-path short. Capping maxTokens trims tail latency. (A) Opus on provisioned throughput is high-capacity but high-latency per token; over-spec’d for short replies. (B) batch is offline; it cannot meet a 1.5s SLA at all. (D) JumpStart adds endpoint operations the scenario does not justify, and 70B is overkill for short conversational replies.

End of Chapter 2

Demo edition ★

You’ve reached the demo’s edge.

The Field Manual doesn’t stop here — eighteen more chapters carry the same treatment across every AIP-C01 domain.

If the first two chapters earned the time you spent on them, the rest of the book is built the same way: decision-oriented, service-by-service, grounded in the exam’s twenty task statements. Below is what each format gets you and where to pick one up.

How the full edition continues

Part I — Foundation Models (Chapters 1–6): you’ve seen 2 of 6. The remaining four cover Data Pipelines, Vector Stores, Retrieval (RAG), and Prompt Engineering.
Part II — Implementation (Chapters 7–11): Bedrock vs SageMaker selection, Knowledge Bases, Agents, model evaluation, and deployment patterns.
Part III — Security & Governance (Chapters 12–15): IAM scoping, Guardrails, PII redaction, and compliance.
Part IV — Optimization (Chapters 16–18): cost, performance, monitoring.
Part V — Evaluation & Troubleshooting (Chapters 19–20): metrics, drift, incident response.
Back matter: 9-week study plan, glossary, exam-day cheat sheets.

Pick the format that fits

The full edition is available in four formats
Format	Best for	Delivery
Digital HTML	Reading on desktop or tablet — same experience as this demo, with all 20 chapters and back matter.	Single self-contained `.html` file. No DRM. Unlimited devices.	Buy Digital →
PDF	Offline study, print-friendly, annotation in any PDF reader.	Letter-size PDF with page numbers, running headers, and recto chapter starts.	Buy PDF →
Kindle	Reading on Kindle devices or the Kindle app — reflowable text.	EPUB delivered through Amazon Kindle Direct.	Get on Kindle →
Paperback	Physical reference — 6×9″ trim, perfect-bound, ships globally via Amazon.	Printed via Amazon KDP; same content as the digital editions.	Order Paperback →

Was the demo useful?

If a service comparison felt thin, a decision table missed a corner case, or you spotted a fact that needs updating — write to press@minecloudcraftpress.com. Field reports from candidates studying for the exam are the only thing that keeps this guide honest. Replies usually land within 48 hours.

MineCloudCraft Press is the publishing arm of MineCloudCraft — an independent practice covering consultancy, training, and mentoring for teams building production AI on AWS. Back to MineCloudCraft Press →

End of Demo edition

Production Generative AI on AWS

Copyright & Disclaimer

Independent Publication

No Warranty

Trademarks

Contents

How to Use This Guide

Who this guide is for

How to read this book

A nine-week study plan

What this guide is not

A note from the author

The AIP-C01 Exam, In One Chapter

Domain weightings

Question formats

Passing standard

Time budget

What the exam loves to test

Foundation ModelIntegration & Data Management

Analyze Requirements & Design GenAI Solutions

1.1 · Is this even a GenAI problem?

Common GenAI use case categories

1.2 · Functional & non-functional requirements

Functional requirements

Non-functional requirements

1.3 · The five canonical AWS GenAI patterns

Pattern 1 · Direct API integration

Pattern 2 · Retrieval-Augmented Generation (RAG)

Pattern 3 · Agents and tool use

Pattern 4 · Fine-tuning & custom models

Pattern 5 · Multi-model & ensemble

1.4 · Cost optimization at design time

1.5 · The Well-Architected GenAI lens, in summary

Chapter summary

Review Questions

Select & Configure Foundation Models

2.1 · The Bedrock model landscape

2.2 · The selection trade-off — one mental model, four axes

2.3 · Inference parameters that actually matter

Configuring inference with the Converse API

2.4 · Inference modes — on-demand, provisioned, batch

2.5 · Squeezing cost without losing capability

Prompt caching

Model cascading (router pattern)

Distillation & fine-tuning for a smaller model

Chapter summary

Review Questions

You’ve reached the demo’s edge.

How the full edition continues

Pick the format that fits

Was the demo useful?

Production
Generative AI
on AWS

Foundation Model
Integration & Data Management