Exhaustive Guide for the AWS Certified Solutions Architect – Professional (SAP-C02) Exam
This document aims to provide a comprehensive and detailed guide for the AWS Certified Solutions Architect – Professional (SAP-C02) exam, covering all topics and subdomains as per the official AWS guide. The content will be structured to facilitate study, quick reference, and in-depth understanding of the concepts required for certification.
Exam Domains and Weighting
The SAP-C02 exam is divided into four main domains, each with a specific weighting in the assessed content:
- Domain 1: Design Solutions for Organizational Complexity (26% of content)
- Domain 2: Design for New Solutions (29% of content)
- Domain 3: Continuous Improvement for Existing Solutions (25% of content)
- Domain 4: Accelerate Workload Migration and Modernization (20% of content)
Domain 1: Design Solutions for Organizational Complexity
This domain focuses on the ability to design solutions that address the challenges of complex and large-scale environments, including network strategies, security, governance, and account management.
Topics and Subtopics:
1.1. Architect network connectivity strategies.
- Knowledge of:
- AWS Global Infrastructure
- AWS networking concepts (e.g., Amazon Virtual Private Cloud [Amazon VPC], AWS Direct Connect, AWS VPN, transit routing, AWS container services)
- Hybrid DNS concepts (e.g., Amazon Route 53 Resolver, on-premises DNS integration)
- Network segmentation (e.g., subnets, IP addressing, inter-VPC connectivity)
- Network traffic monitoring
1.2. Architect security strategies for complex environments.
- Knowledge of:
- AWS Organizations and Organizational Units (OUs)
- Service Control Policies (SCPs)
- Advanced AWS Identity and Access Management (IAM) (e.g., permission boundaries, ABAC, cross-account roles)
- AWS Security Hub, Amazon GuardDuty, Amazon Detective, AWS Config, AWS Inspector
- Key management (e.g., AWS Key Management Service [AWS KMS] multi-Region)
- Data security (e.g., encryption in transit and at rest, tokenization, data masking)
- Secrets management (e.g., AWS Secrets Manager)
- Auditing and logging (e.g., AWS CloudTrail, Amazon CloudWatch Logs)
- Principle of least privilege
1.3. Architect governance and compliance strategies.
- Knowledge of:
- AWS Control Tower
- AWS Config and Conformance Packs
- AWS Audit Manager
- AWS License Manager
- AWS Systems Manager
- AWS Resource Access Manager (AWS RAM)
- AWS Service Catalog
- AWS Organizations and account management
- Tags and tagging policies
1.4. Architect account and multi-account management strategies.
- Knowledge of:
- AWS Organizations and OUs
- AWS Control Tower
- AWS Single Sign-On (AWS SSO)
- AWS IAM Identity Center
- Landing Zones
- AWS Resource Access Manager (AWS RAM)
- Resource sharing across accounts
- Consolidated billing
Domain 2: Design for New Solutions
This domain covers the ability to design new solutions on AWS, focusing on service selection, architectural patterns, resilience, scalability, and performance optimization.
Topics and Subtopics:
2.1. Architect compute solutions.
- Knowledge of:
- Amazon EC2 (e.g., Spot/mixed fleets, Auto Scaling)
- AWS Lambda
- Containers (e.g., Amazon Elastic Container Service [Amazon ECS], Amazon Elastic Kubernetes Service [Amazon EKS], AWS Fargate)
- AWS Outposts
- AWS Wavelength
- AWS Local Zones
- AWS Batch
- AWS ParallelCluster
- AWS Serverless Application Model (AWS SAM)
- AWS Amplify
2.2. Architect storage solutions.
- Knowledge of:
- Amazon S3 (e.g., storage classes, access points, lifecycle policies)
- Amazon Elastic File System (Amazon EFS)
- Amazon FSx (e.g., FSx for Lustre, FSx for Windows File Server, FSx for NetApp ONTAP)
- Amazon Elastic Block Store (Amazon EBS)
- AWS Backup
- Amazon Glacier
- AWS Storage Gateway
- AWS Snow Family
2.3. Architect database solutions.
- Knowledge of:
- Amazon Relational Database Service (Amazon RDS)
- Amazon Aurora
- Amazon DynamoDB
- Amazon DocumentDB (with MongoDB compatibility)
- Amazon Neptune
- Amazon Redshift
- Amazon ElastiCache
- Amazon Quantum Ledger Database (Amazon QLDB)
- Amazon Managed Blockchain
- Amazon Keyspaces (for Apache Cassandra)
- Amazon Timestream
- Amazon Kinesis
- AWS Database Migration Service (AWS DMS)
- AWS Schema Conversion Tool (AWS SCT)
2.4. Architect networking solutions.
- Knowledge of:
- Amazon VPC (e.g., subnets, route tables, internet gateways, NAT gateways, VPC endpoints)
- AWS Transit Gateway
- AWS PrivateLink
- VPC Peering
- AWS Direct Connect
- AWS VPN
- Amazon Route 53 (e.g., multi-Region DNS, routing policies)
- Amazon CloudFront (e.g., signed URLs, signed cookies, geo-restrictions)
- AWS Global Accelerator
- AWS App Mesh
- AWS Cloud Map
- VPC Lattice
2.5. Architect messaging and integration solutions.
- Knowledge of:
- Amazon Simple Notification Service (Amazon SNS)
- Amazon Simple Queue Service (Amazon SQS)
- Amazon EventBridge
- AWS Step Functions
- Amazon Kinesis (e.g., Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics)
- Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- AWS AppSync
- Amazon API Gateway
- AWS IoT Core
Domain 3: Continuous Improvement for Existing Solutions
This domain focuses on optimizing and enhancing existing architectures to improve performance, resilience, security, and operational efficiency.
Topics and Subtopics:
3.1. Architect solutions for high availability and resilience.
- Knowledge of:
- Disaster recovery strategies (e.g., RTO, RPO, DR patterns: backup and restore, pilot light, warm standby, multi-site active-active)
- AWS Well-Architected Framework (Reliability Pillar)
- Amazon Route 53 (e.g., failover, latency-based routing)
- AWS Global Accelerator
- Elastic Load Balancing (ELB)
- Amazon EC2 Auto Scaling
- Amazon CloudFront
- AWS Shield, AWS WAF
- Circuit Breaker pattern
- Sagas pattern
3.2. Architect solutions for performance optimization.
- Knowledge of:
- AWS Well-Architected Framework (Performance Efficiency Pillar)
- Compute optimization (e.g., instance selection, container optimization, Lambda provisioned concurrency)
- Storage optimization (e.g., S3 class selection, EBS, EFS, FSx optimization)
- Database optimization (e.g., DB type selection, query optimization, caching with ElastiCache)
- Network optimization (e.g., Direct Connect, Global Accelerator, CloudFront)
- Caching (e.g., Amazon CloudFront, Amazon ElastiCache, AWS Global Accelerator)
- Content Delivery Networks (CDNs)
3.3. Architect solutions for monitoring and observability.
- Knowledge of:
- Amazon CloudWatch (e.g., metrics, logs, alarms, dashboards)
- AWS X-Ray
- AWS CloudTrail
- AWS Config
- Amazon Kinesis (for log and metric ingestion)
- AWS Systems Manager (e.g., OpsCenter, Explorer)
- AWS Health Dashboard
- AWS Trusted Advisor
- Feature flags with CloudWatch Evidently
3.4. Architect solutions for continuous security.
- Knowledge of:
- AWS Security Hub, Amazon GuardDuty, Amazon Detective, AWS Inspector
- AWS WAF, AWS Shield
- AWS Firewall Manager
- AWS Certificate Manager (ACM)
- AWS Secrets Manager
- AWS Systems Manager Parameter Store
- AWS Single Sign-On (AWS SSO) / AWS IAM Identity Center
- Principle of least privilege
- Security automation (e.g., AWS Security Hub, AWS Config Rules)
Domain 4: Accelerate Workload Migration and Modernization
This domain covers strategies and tools for migrating and modernizing workloads to AWS, including assessment, planning, and execution of complex migrations.
Topics and Subtopics:
4.1. Architect migration strategies.
- Knowledge of:
- Migration strategies (e.g., 7 Rs: Rehost, Replatform, Refactor/Re-architect, Repurchase, Retain, Retire, Relocate)
- AWS Migration Acceleration Program (MAP)
- AWS Application Migration Service (AWS MGN)
- AWS Database Migration Service (AWS DMS)
- AWS Schema Conversion Tool (AWS SCT)
- AWS DataSync
- AWS Snow Family
- AWS Transfer Family
- AWS Direct Connect
- AWS VPN
- AWS CloudEndure Migration
4.2. Architect modernization strategies.
- Knowledge of:
- Application modernization (e.g., microservices, serverless, containers)
- Code refactoring
- Use of managed services (e.g., Amazon RDS, Amazon DynamoDB, AWS Lambda)
- Service integration (e.g., Amazon EventBridge, AWS Step Functions)
- Modernization patterns (e.g., Strangler Fig, Branch by Abstraction)
- Modernization tools (e.g., AWS Migration Hub Refactor Spaces, AWS App2Container)
4.3. Architect cost optimization strategies.
- Knowledge of:
- AWS Well-Architected Framework (Cost Optimization Pillar)
- AWS pricing models (e.g., On-Demand, Reserved Instances [RIs], Savings Plans, Spot Instances)
- Resource optimization (e.g., right-sizing, shutting down idle resources)
- AWS Cost Explorer, AWS Budgets, AWS Cost Anomaly Detection
- AWS Compute Optimizer
- AWS Trusted Advisor (Cost Optimization)
- Amazon S3 lifecycle policies
- Network optimization (e.g., CloudFront, Direct Connect usage)
Relevant New Features (Examples)
- IAM Roles Anywhere
- VPC Lattice
- CloudFront Signed URLs
- Feature flags with CloudWatch Evidently
Comparative Tables and Practical Scenarios
This section will be filled with detailed comparative tables, practical scenario examples that appear in the exam, architectural decisions, and common pitfalls, consolidating learning and facilitating quick reference.
This document will be continuously expanded and detailed in each of the listed topics and subtopics, providing a complete guide for SAP-C02 exam preparation.
1.1. Architect network connectivity strategies.
Fundamental and Advanced Concepts
Network connectivity in a complex AWS environment is the backbone of any successful architecture. It ranges from basic connectivity within a VPC to hybrid and global network topologies. The SAP-C02 exam expects candidates to have a deep understanding of how to design, secure, and optimize these connections.
AWS Global Infrastructure: AWS has a global infrastructure composed of Regions, Availability Zones (AZs), Local Zones, Wavelength Zones, and Points of Presence (PoPs). Understanding how these components interlink is fundamental. Regions are geographically isolated, while AZs within a Region are connected with low latency. Local Zones and Wavelength Zones extend AWS services closer to end-users or 5G devices, respectively, for ultra-low latency applications.
Amazon VPC: The VPC is the fundamental building block for networking on AWS. It allows you to provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including selecting your own IP address range, creating subnets, and configuring route tables and network gateways.
Hybrid Connectivity: Hybrid connectivity refers to connecting your on-premises network to your VPC on AWS. The two main options for this are:
- AWS Direct Connect (DX): Provides a dedicated, private network connection from your data center to AWS. Direct Connect can reduce network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.
- AWS Site-to-Site VPN: Creates a secure connection between your on-premises network and your VPCs. VPN connections are established over the public Internet and are encrypted with IPsec.
Transit Routing: In a complex network topology with multiple VPCs and on-premises connections, transit routing becomes essential. AWS Transit Gateway (TGW) is a managed service that acts as a central hub to connect your VPCs and on-premises networks. It simplifies network topology, eliminating the need for complex and full-mesh VPC Peering connections.
Hybrid DNS: Amazon Route 53 Resolver enables bidirectional DNS resolution between your on-premises networks and AWS. You can create Route 53 Resolver endpoints in your VPCs that forward DNS queries from your on-premises systems to Route 53 and vice versa. This is crucial for applications in both environments to communicate using familiar domain names.
Network Segmentation: Network segmentation is a fundamental security practice that involves dividing a network into smaller, isolated subnets. On AWS, this is achieved through the use of VPCs, subnets, Network Access Control Lists (NACLs), and Security Groups. Segmentation helps contain the impact of a security breach and enforce granular security policies.
Network Traffic Monitoring: To ensure the security and performance of your network, it is essential to monitor traffic. AWS offers several tools for this, including:
- VPC Flow Logs: Captures information about the IP traffic going to and from network interfaces in your VPC.
- AWS CloudTrail: Records all API calls made in your account, including those related to network configuration changes.
- Amazon CloudWatch: Collects and monitors network metrics from various AWS services, allowing you to create alarms and dashboards.
Comparisons Between Services and When to Use Each One
AWS Transit Gateway vs. VPC Peering:
- VPC Peering: Is a network connection between two VPCs that allows them to communicate as if they were in the same network. VPC Peering is ideal for a small number of VPCs. However, it does not support transit routing (i.e., if VPC A is peered with VPC B and VPC B is peered with VPC C, VPC A cannot communicate with VPC C through VPC B). This leads to a complex mesh topology as the number of VPCs increases.
- AWS Transit Gateway: Acts as a central hub that connects multiple VPCs and on-premises networks. It simplifies network management, as each VPC only needs to connect to the Transit Gateway. TGW supports transit routing and is the recommended solution for architectures with a large number of VPCs.
AWS Direct Connect vs. AWS Site-to-Site VPN:
- AWS Site-to-Site VPN: Is a quick and low-cost solution to establish a secure connection between your on-premises network and AWS. However, as the connection is made over the public Internet, performance can be inconsistent.
- AWS Direct Connect: Offers a private, dedicated connection with high bandwidth and consistent performance. It is the ideal choice for workloads requiring high performance, low latency, and enhanced security. The initial cost of Direct Connect is higher than VPN.
Configurations, Limitations, Best Practices, and Use Cases
AWS Transit Gateway:
- Configuration: Create a Transit Gateway in your account and attach your VPCs and VPN/Direct Connect connections to it. Configure Transit Gateway route tables to control traffic flow.
- Limitations: There are limits to the number of attachments, route tables, and routes per route table. Refer to AWS documentation for the latest limits.
- Best Practices: Use a dedicated network account to host the Transit Gateway. Use separate route tables to isolate different environments (e.g., production, development, test). Monitor Transit Gateway traffic using CloudWatch.
- Use Cases: Enterprise networks with multiple VPCs, hybrid connectivity at scale, sharing centralized services.
AWS Direct Connect:
- Configuration: Request a Direct Connect connection through the AWS Console. Work with an AWS partner to establish the physical connection between your data center and a Direct Connect location. Configure Virtual Interfaces (VIFs) to connect to your VPCs.
- Limitations: Bandwidth is limited by the port you provision (from 50 Mbps to 100 Gbps). The physical location of your data center can be a limiting factor.
- Best Practices: Use redundant Direct Connect connections for high availability. Use Link Aggregation Group (LAG) to combine multiple connections into a single, higher-bandwidth logical connection. Monitor connection utilization.
- Use Cases: Migration of large data volumes, low-latency workloads, applications requiring high bandwidth.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
Hub-and-Spoke Network Architecture with Transit Gateway: This is the most common architectural model for complex networks on AWS. The Transit Gateway acts as the hub, and VPCs and on-premises connections are the spokes. This simplifies network management and improves scalability.
Security: Use NACLs and Security Groups to control traffic at the subnet and instance level, respectively. Use AWS Network Firewall to inspect and filter traffic in your VPC. Use AWS Shield for DDoS protection.
Resilience: Design your network to be highly available, using multiple AZs and Regions. Use redundant VPN and Direct Connect connections. Use Route 53 for DNS failover.
Governance: Use AWS Organizations to centrally manage your accounts and network policies. Use AWS Control Tower to provision a secure and well-architected multi-account environment. Use AWS Resource Access Manager (RAM) to share network resources, such as the Transit Gateway, across accounts.
Migration and Modernization Strategies
When migrating to AWS, start with a solid network foundation. Use Transit Gateway to connect your VPCs and your on-premises network. As you modernize your applications, use services like AWS PrivateLink to securely expose services between VPCs without exposing them to the public Internet.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A company has 50 VPCs in a single Region and needs all of them to communicate with each other. What is the most scalable and manageable solution?
- Architectural Decision: Use AWS Transit Gateway. VPC Peering would require a mesh of almost 1225 connections, which would be unmanageable.
Common Pitfall: Not understanding the cost implications of data transfer between AZs. Data transfer between AZs incurs costs, which can be significant in high-traffic architectures. Optimize the location of your resources to minimize inter-AZ traffic.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
Hub-and-Spoke Network Architecture with Transit Gateway: This is the most common architectural model for complex networks on AWS. The Transit Gateway acts as the hub, and VPCs and on-premises connections are the spokes. This simplifies network management and improves scalability.
Security: Use NACLs and Security Groups to control traffic at the subnet and instance level, respectively. Use AWS Network Firewall to inspect and filter traffic in your VPC. Use AWS Shield for DDoS protection.
Resilience: Design your network to be highly available, using multiple AZs and Regions. Use redundant VPN and Direct Connect connections. Use Route 53 for DNS failover.
Governance: Use AWS Organizations to centrally manage your accounts and network policies. Use AWS Control Tower to provision a secure and well-architected multi-account environment. Use AWS Resource Access Manager (RAM) to share network resources, such as the Transit Gateway, across accounts.
Migration and Modernization Strategies
When migrating to AWS, start with a solid network foundation. Use Transit Gateway to connect your VPCs and your on-premises network. As you modernize your applications, use services like AWS PrivateLink to securely expose services between VPCs without exposing them to the public Internet.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A company has 50 VPCs in a single Region and needs all of them to communicate with each other. What is the most scalable and manageable solution?
- Architectural Decision: Use AWS Transit Gateway. VPC Peering would require a mesh of almost 1225 connections, which would be unmanageable.
Common Pitfall: Not understanding the cost implications of data transfer between AZs. Data transfer between AZs incurs costs, which can be significant in high-traffic architectures. Optimize the location of your resources to minimize inter-AZ traffic.
Details on IAM policies, SCPs, multi-VPC and multi-Region networks, landing zones, observability, costs, and optimization
Multi-Account Strategies and Landing Zones:
A multi-account strategy is an AWS best practice for isolating workloads, managing costs, applying security, and granular governance. AWS Organizations allows you to centrally manage multiple AWS accounts that you own. A Landing Zone is a well-architected, multi-account AWS environment that is scalable and secure, serving as a starting point for workload deployment. AWS Control Tower is a service that automates the setup of a landing zone, applying AWS best practices for identity, federated access, and account structure. It creates an Organizational Unit (OU) structure and pre-configured accounts (such as Log Archive and Audit), as well as configuring guardrails (preventive and detective policies).
Service Control Policies (SCPs):
SCPs are a type of policy that you can use to manage permissions in your organization in AWS Organizations. SCPs offer centralized control over the maximum permissions available to all accounts in your organization or in specific OUs. They do not grant permissions; instead, they define the boundaries for the permissions that IAM users and roles can have. For example, you can use an SCP to prevent any account in a specific OU from launching EC2 instances in an unapproved region. It is good practice to test SCPs in test OUs before applying them in production to avoid accidental lockouts.
Advanced IAM:
- Permission Boundaries: A permission boundary is an IAM managed policy that defines the maximum permissions that an identity-based policy can grant to an IAM entity (user or role). It does not grant permissions by itself but acts as a filter for effective permissions. It is useful for delegating the creation of roles and users without granting excessive permissions.
- Attribute-Based Access Control (ABAC): ABAC is an authorization strategy that defines permissions based on attributes. In AWS, these attributes are called tags. You can design permissions that allow or deny access to resources based on tags attached to those resources, or tags attached to the principal making the request. This simplifies permission management in dynamic and large-scale environments.
- Cross-Account Roles: IAM roles can be used to grant secure access between AWS accounts. Instead of sharing long-term access credentials, an IAM role allows one account to trust another account to assume that role and access its resources. This is fundamental for scenarios such as CI/CD deployment, audit access, or resource sharing.
Multi-VPC and Multi-Region Networks:
- AWS Transit Gateway: As mentioned, TGW is the preferred solution for connecting multiple VPCs and on-premises networks, simplifying routing and connectivity in complex architectures. It supports inter-Region connectivity, allowing you to create a global network.
- AWS PrivateLink: Allows you to expose services in your VPC to other VPCs (yours or other customers’) or to on-premises networks, without exposing traffic to the public Internet. This is ideal for creating private and secure connectivity for SaaS services or shared internal services.
- VPC Peering: While useful for a limited number of point-to-point connections, it is not scalable for complex networks due to its non-transitive nature.
- Multi-Region DNS with Amazon Route 53: Route 53 offers advanced DNS routing features, such as latency-based routing, failover routing, and geolocation routing, which are crucial for multi-Region architectures, ensuring high availability and low latency for users.
Observability:
In a multi-account and complex environment, observability is vital. Tools like Amazon CloudWatch (for centralized metrics and logs), AWS X-Ray (for distributed request tracing), AWS CloudTrail (for API call auditing), and AWS Config (for resource configuration compliance monitoring) are essential for a unified view of your applications’ and infrastructure’s behavior and health.
Costs and Optimization:
Managing costs in a multi-account environment requires visibility and control. AWS Cost Explorer and AWS Budgets help monitor and control spending. Applying consistent tags across all resources and accounts is fundamental for allocating costs and identifying optimization opportunities. AWS Compute Optimizer provides recommendations for optimizing compute resource usage, while AWS Trusted Advisor offers insights into cost optimization, performance, security, and fault tolerance.
Domain 2: Design for New Solutions
This domain covers the ability to design new solutions on AWS, focusing on service selection, architectural patterns, resilience, scalability, and performance optimization. It is crucial to understand the characteristics of each service to make informed architectural decisions that meet business and technical requirements.
Topics and Subtopics:
2.1. Architect compute solutions.
Fundamental and Advanced Concepts
AWS offers a wide range of compute services, each optimized for different types of workloads. Choosing the right compute service is a critical architectural decision that directly impacts the scalability, performance, cost, and management of the solution.
- Amazon EC2 (Elastic Compute Cloud): Provides secure, resizable compute capacity in the cloud. It is the foundation for many architectures and offers granular control over instances (type, operating system, storage, network). EC2 is ideal for workloads that require full control over the server environment, such as legacy applications, specific databases, or workloads that need GPUs.
- Spot Instances/Mixed Fleets: Spot Instances allow you to request unused EC2 capacity from AWS at steep discounts. They are ideal for fault-tolerant workloads, such as batch processing, rendering, or testing. Mixed Fleets combine On-Demand, Reserved Instances, and Spot Instances to optimize costs and availability.
- Auto Scaling: Automatically scales EC2 capacity up or down according to demand. Ensures high availability and cost optimization by automatically launching or terminating instances.
- AWS Lambda: A serverless compute service that lets you run code without provisioning or managing servers. You pay only for the compute time consumed. Ideal for short-lived, event-driven functions, such as real-time data processing, API backends, or task automation. Supports various programming languages.
- Containers (Amazon ECS, Amazon EKS, AWS Fargate): AWS offers several options for running containerized workloads:
- Amazon Elastic Container Service (ECS): A fully managed container orchestration service that makes it easy to run, stop, and manage Docker containers on a cluster. It is well integrated with other AWS services.
- Amazon Elastic Kubernetes Service (EKS): A managed Kubernetes service that makes it easy to run Kubernetes applications on AWS without the need to install, operate, and maintain your own Kubernetes control plane. Ideal for those already using Kubernetes or seeking portability.
- AWS Fargate: A serverless compute engine for containers that works with ECS and EKS. With Fargate, you don’t need to provision, configure, or scale virtual machine clusters. You pay only for the compute resources your containers use. Ideal for simplifying container operations.
- AWS Outposts, Wavelength, Local Zones: Extend AWS infrastructure and services to on-premises locations (Outposts), to the 5G network edge (Wavelength), or to metropolitan areas (Local Zones), allowing customers to run applications with low latency for end-users or local equipment.
- AWS Batch: Allows you to run batch computing workloads at scale. It dynamically provisions the optimal amount of compute resources (EC2, Spot) based on workload requirements.
- AWS ParallelCluster: An open-source tool that makes it easy to deploy and manage high-performance computing (HPC) clusters on AWS.
- AWS Serverless Application Model (AWS SAM) and AWS Amplify: Tools to simplify the development and deployment of serverless and web/mobile applications, respectively.
Comparisons Between Services and When to Use Each One
EC2 vs. Lambda vs. Fargate:
Characteristic | Amazon EC2 | AWS Lambda | AWS Fargate |
---|---|---|---|
Abstraction Level | Virtual Servers (IaaS) | Functions (Serverless, FaaS) | Containers (Serverless, CaaS) |
Management | High (OS, patches, scalability) | Low (code only) | Medium (containers only, no infra) |
Scalability | Auto Scaling Groups (minutes) | Automatic (milliseconds) | Automatic (seconds) |
Cost | Per hour/second (instance) | Per execution and compute time | Per vCPU and GB of memory per second |
Use Cases | Legacy applications, DBs, full control | Event-driven functions, stateless APIs | Microservices, containerized applications |
Execution Time | Continuous | Short duration (up to 15 min) | Long duration |
- Use EC2 when: You need full control over the operating system, have legacy applications that cannot be easily containerized or refactored, or need specific instance types (GPUs, HPC).
- Use Lambda when: You have event-driven workloads, short-lived functions, need instant scalability, and don’t want to manage servers. Ideal for API backends, file processing, automation.
- Use Fargate when: You want to run containers without managing the underlying infrastructure. Ideal for microservices, containerized web applications, where container portability is important, but VM management is not desired.
Configurations, Limitations, Best Practices, and Use Cases
Amazon EC2:
- Configurations: Choose instance type (family, size), AMI (Amazon Machine Image), storage (EBS), network (VPC, subnet, Security Groups). Configure Auto Scaling Groups with scaling policies.
- Limitations: Service limits for the number of instances, vCPUs, EBS volumes per region. Costs can scale rapidly if not optimized.
- Best Practices: Use Spot Instances for flexible workloads. Use Auto Scaling for resilience and cost optimization. Use User Data for bootstrap automation. Monitor with CloudWatch.
- Use Cases: Web servers, application servers, relational databases (with RDS or self-managed), Big Data clusters.
AWS Lambda:
- Configurations: Define function (code), runtime, memory, timeout, environment variables, triggers (events). Configure VPC for private resource access.
- Limitation: Execution timeout (15 minutes), deployment package size, maximum memory. Cold starts can impact latency.
- Best Practices: Keep functions small and with a single responsibility. Use Provisioned Concurrency to reduce cold starts in latency-sensitive workloads. Monitor with CloudWatch Logs and X-Ray.
- Use Cases: Event processing (S3, DynamoDB Streams), API backends (with API Gateway), chatbots, infrastructure automation.
AWS Fargate (with ECS or EKS):
- Configurations: Define task definition (container image, vCPU, memory), scaling policies. Integrate with Load Balancers and Service Discovery.
- Limitations: No access to the underlying operating system. Costs can be higher than EC2 for long-running, high-utilization workloads.
- Best Practices: Use optimized and small container images. Implement health checks to ensure availability. Monitor with CloudWatch Container Insights.
- Use Cases: Microservices, containerized APIs, web applications, containerized data processing.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Microservices Architecture: Often implemented with containers (ECS/EKS/Fargate) and Lambda for serverless components. Each service is independent, scalable, and resilient.
- Security Patterns: Use IAM Roles to grant minimum permissions to compute services. Isolate resources in private subnets. Use Security Groups to control network traffic. Encrypt data in transit and at rest.
- Resilience: Distribute workloads across multiple AZs. Use Auto Scaling to handle traffic spikes and instance failures. Implement health checks and circuit breakers.
- Governance: Use AWS Config to monitor configuration compliance. Implement tagging policies for cost and resource tracking.
Migration and Modernization Strategies
- Rehost (Lift-and-Shift): Migrate on-premises VMs to EC2. It is the fastest strategy but does not fully leverage cloud benefits.
- Replatform: Migrate applications to EC2 with some optimizations, such as using RDS for databases or Auto Scaling. May involve containerizing applications for ECS/EKS.
- Refactor/Re-architect: Refactor applications to use serverless services (Lambda, Fargate) and microservices. This strategy leverages the most cloud benefits but is also the most complex and time-consuming.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A web application has unpredictable traffic spikes and needs rapid scalability and optimized costs. The application consists of a static frontend and a RESTful API backend.
- Architectural Decision: Host the static frontend on Amazon S3 and use Amazon CloudFront for content delivery. For the backend, use AWS Lambda with Amazon API Gateway. This provides automatic scalability, high availability, and a pay-per-use cost model.
Common Pitfall: Using EC2 for workloads that would be better suited for Lambda or Fargate, resulting in higher costs and greater management overhead. For example, running a 24/7 web server on EC2 for an application that only receives traffic a few hours a day.
2.2. Architect storage solutions.
Fundamental and Advanced Concepts
AWS offers a wide range of storage services, each designed for different use cases, performance requirements, durability, and cost. Choosing the right storage service is crucial for the performance and economics of a solution.
- Amazon S3 (Simple Storage Service): Highly durable, available, and scalable object storage. Ideal for storing any type of data (images, videos, backups, logs, static website files). Offers different storage classes for cost optimization based on access frequency.
- Storage Classes: S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA (Infrequent Access), S3 One Zone-IA, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive. Each class has a different cost and retrieval time.
- Access Points: Simplify managing data access in S3 at scale, allowing you to create hundreds of access points with distinct access policies for different applications or users.
- Amazon Elastic File System (EFS): A scalable and elastic network file system (NFS) for EC2 instances and on-premises (via Direct Connect or VPN). Ideal for workloads that need shared file access, such as web applications, content management systems, and development environments.
- Amazon FSx: Offers fully managed file systems for specific workloads:
- FSx for Lustre: For high-performance computing (HPC) workloads that require fast file storage.
- FSx for Windows File Server: For workloads that require Windows-based file shares with SMB (Server Message Block) support.
- FSx for NetApp ONTAP: Offers ONTAP data management features with the agility and scalability of AWS.
- Amazon Elastic Block Store (EBS): Provides persistent block storage volumes for use with EC2 instances. It’s like a virtual hard drive that you can attach to an instance. Ideal for databases, file systems, and any application that needs low-latency, high-performance storage.
- AWS Backup: A centralized, managed backup service that makes it easy to back up data from various AWS services (EBS, RDS, DynamoDB, EFS, FSx, EC2, Storage Gateway) and on-premises.
- Amazon Glacier: An S3 storage class for long-term, low-cost data archiving. Ideal for data that is rarely accessed but needs to be retained for years or decades.
- AWS Storage Gateway: Connects on-premises environments with AWS cloud storage, offering file, volume, and tape gateways for different hybrid storage use cases.
- AWS Snow Family: Physical devices (Snowcone, Snowball, Snowmobile) for transferring large volumes of data into and out of AWS, or for running compute at the edge in disconnected or remote environments.
Comparisons Between Services and When to Use Each One
Characteristic | Amazon S3 | Amazon EFS | Amazon FSx | Amazon EBS |
---|---|---|---|---|
Storage Type | Object | File (NFS) | File (Lustre, SMB, ONTAP) | Block |
Use Cases | Unstructured data, backups, static files | File sharing, CMS, dev/test | HPC, Windows file shares, ONTAP features | Databases, EC2 OS, application logs |
Access | HTTP/HTTPS, SDKs, APIs | NFSv4.1 | Lustre, SMB, ONTAP | Attached to a single EC2 instance |
Scalability | Virtually unlimited | Petabytes, elastic | Petabytes, scalable | Limited to volume size |
Performance | Variable (depends on class) | Variable (depends on mode) | High (Lustre), Medium (Windows) | High (depends on volume type) |
Durability | 11 nines (99.999999999%) | 11 nines (99.999999999%) | High (depends on type) | High (depends on volume type) |
Availability | High (99.9% to 99.99%) | Multi-AZ | Multi-AZ (Windows, ONTAP), Single-AZ (Lustre) | Single-AZ (replicated within AZ) |
- S3 vs. EFS vs. FSx: The main difference is the access model. S3 is for objects (API access), EFS and FSx are for files (file system access). The choice between EFS and FSx depends on the file system type and specific features needed (generic NFS vs. Lustre/SMB/ONTAP).
- S3 vs. Glacier: Glacier is an S3 storage class, optimized for long-term archiving at very low costs, but with longer retrieval times. S3 Standard is for frequent access, while IA classes are for infrequent access.
Configurations, Limitations, Best Practices, and Use Cases
Amazon S3:
- Configurations: Create buckets, define bucket policies, configure lifecycle policies for transitioning between storage classes, configure versioning, encryption (SSE-S3, SSE-KMS, SSE-C).
- Limitations: Maximum object size (5 TB), request rate limits per prefix.
- Best Practices: Use lifecycle policies to optimize costs. Enable versioning for protection against accidental deletion. Use Access Points to manage access at scale. Use encryption by default.
- Use Cases: Data lakes, backups, static website hosting, log storage, content distribution.
Amazon EFS:
- Configurations: Create an EFS file system, define performance modes (General Purpose, Max I/O), throughput modes (Bursting, Provisioned), create mount targets in your VPCs.
- Limitations: Slightly higher latency than EBS for block access. Cost can be higher for workloads that don’t need sharing.
- Best Practices: Use General Purpose mode for most workloads. Use Max I/O mode for high-concurrency workloads. Monitor throughput and burst credits usage.
- Use Cases: User home directories, shared development environments, content management systems (CMS), Big Data workloads.
Amazon EBS:
- Configurations: Choose volume type (gp2/gp3, io1/io2, st1, sc1), size, IOPS/throughput. Create snapshots for backup. Attach to EC2 instances.
- Limitations: Attached to a single EC2 instance at a time. Performance varies with volume type.
- Best Practices: Use gp3 for most workloads. Use io2 Block Express for high-performance workloads. Create snapshots regularly for disaster recovery. Encrypt EBS volumes by default.
- Use Cases: EC2 instance boot volumes, databases, application logs, data volumes for applications requiring persistent, low-latency storage.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Data Lake on S3: Store raw and processed data on S3, using different storage classes for cost optimization. Integrate with services like AWS Glue, Amazon Athena, and Amazon Redshift Spectrum for analysis.
- Shared Storage for Web Applications: Use EFS to share content files between multiple EC2 web servers in an Auto Scaling Group.
- Security Patterns: Use S3 bucket policies, ACLs, and access point policies to control access. Use IAM Roles for access to storage services. Encrypt data at rest (SSE-S3, SSE-KMS) and in transit (SSL/TLS). Use AWS PrivateLink for private access to S3 from within the VPC.
- Resilience: S3, EFS, and FSx for Windows/ONTAP are designed for high durability and multi-AZ availability. For EBS, use snapshots for backup and replication across AZs for disaster recovery.
- Governance: Use AWS Config to monitor storage configuration compliance. Implement lifecycle policies to manage data lifecycle and optimize costs.
Migration and Modernization Strategies
- Data Migration: Use AWS DataSync to transfer large volumes of data from on-premises to S3, EFS, or FSx. Use the AWS Snow Family for offline petabyte-scale data migrations.
- Modernization: Replace on-premises file servers with EFS or FSx. Migrate databases to RDS or DynamoDB, which use EBS and S3 internally. Refactor applications to use S3 for object storage instead of local file systems.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A Big Data analytics application generates large volumes of data that need to be stored cost-effectively and accessed occasionally for historical reports.
- Architectural Decision: Store data in Amazon S3 with lifecycle policies to transition older data to S3 Intelligent-Tiering and eventually to S3 Glacier Deep Archive. This optimizes storage costs while keeping data accessible when needed.
Common Pitfall: Using S3 Standard for data that is rarely accessed, resulting in unnecessarily high storage costs. Or, conversely, using Glacier for data that needs quick retrieval, incurring high retrieval costs and delays.
2.3. Architect database solutions.
Fundamental and Advanced Concepts
AWS offers a wide variety of database services, covering relational, NoSQL, data warehousing, graph, ledger, time-series, and blockchain models. Choosing the right database is fundamental for the performance, scalability, resilience, and cost of an application.
- Amazon Relational Database Service (RDS): A managed relational database service that makes it easy to set up, operate, and scale relational databases in the cloud. Supports various database engines (MySQL, PostgreSQL, Oracle, SQL Server, MariaDB).
- Amazon Aurora: A MySQL and PostgreSQL-compatible relational database built for the cloud. It combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. Offers up to 5x the throughput of standard MySQL and 3x the throughput of standard PostgreSQL.
- Amazon DynamoDB: A fully managed, key-value and document NoSQL database service that delivers single-digit millisecond performance at any scale. Ideal for applications requiring low latency and high throughput, such as gaming, ad tech, and IoT.
- Amazon DocumentDB (with MongoDB compatibility): A fully managed document database service that supports MongoDB workloads. Compatible with MongoDB APIs, allowing you to use existing MongoDB code, drivers, and tools.
- Amazon Neptune: A fast, reliable, fully managed graph database service. Ideal for building and running applications that work with highly connected datasets, such as social networks, recommendation engines, and fraud detection.
- Amazon Redshift: A fast, fully managed, petabyte-scale data warehouse service. Ideal for analyzing large volumes of structured and semi-structured data, enabling complex queries and reporting.
- Amazon ElastiCache: A fully managed in-memory caching service that supports Redis and Memcached. Used to accelerate the performance of web and mobile applications by retrieving information from high-speed in-memory caches instead of relying on slower disk-based databases.
- Amazon Quantum Ledger Database (QLDB): A fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log. Ideal for systems that need a complete and verifiable record of all data changes, such as financial records or supply chain.
- Amazon Managed Blockchain: A fully managed service for creating and managing scalable blockchain networks using popular frameworks like Hyperledger Fabric and Ethereum.
- Amazon Keyspaces (for Apache Cassandra): A serverless, wide-column database service compatible with Apache Cassandra. Ideal for Cassandra workloads that need simplified scalability and management.
- Amazon Timestream: A fast, scalable, and serverless time-series database service. Ideal for IoT and operational applications that collect high-speed time-series data.
- AWS Database Migration Service (DMS): Helps migrate databases to AWS quickly and securely. Supports homogeneous (e.g., Oracle to Oracle) and heterogeneous (e.g., Oracle to PostgreSQL) migrations.
- AWS Schema Conversion Tool (SCT): Helps convert the database schema and application code from a source database to a different target database.
Comparisons Between Services and When to Use Each One
Amazon RDS vs. Amazon Aurora vs. Amazon DynamoDB:
Characteristic | Amazon RDS | Amazon Aurora | Amazon DynamoDB |
---|---|---|---|
Database Type | Relational (SQL) | Relational (SQL) - MySQL/PostgreSQL compatible | NoSQL (Key-Value, Document) |
Scalability | Vertical (larger instance), Read (replicas) | Read (up to 15 replicas), Automatic storage | Horizontal (virtually unlimited) |
Performance | Good, depends on instance type | High (up to 5x MySQL, 3x PostgreSQL) | Single-digit millisecond at any scale |
Cost | Per hour/second (instance, storage) | Per hour/second (instance, storage) | Per provisioned/on-demand throughput, storage |
Use Cases | Traditional web applications, ERPs, CRMs | High-performance applications, OLTP | High-throughput applications, IoT, gaming, microservices |
Management | Managed (OS, patches, backups) | Fully managed, self-healing | Fully managed, serverless |
- Use RDS when: You need a traditional relational database and want the convenience of a managed service without worrying about the underlying infrastructure. Ideal for most web and enterprise applications that benefit from a fixed schema and ACID transactions.
- Use Aurora when: You need a relational database with high performance, scalability, and availability, surpassing standard RDS. It’s a good choice for demanding OLTP (Online Transaction Processing) workloads.
- Use DynamoDB when: You need a NoSQL database with single-digit millisecond performance at any scale, without worrying about server provisioning. Ideal for applications with large data volumes and flexible access patterns, such as session data, user profiles, IoT.
Other Comparisons:
- DocumentDB vs. DynamoDB: Both are NoSQL. DocumentDB is for MongoDB-compatible document workloads, while DynamoDB is more flexible for key-value and documents, with guaranteed performance at scale.
- Redshift vs. RDS/Aurora: Redshift is a data warehouse for analytics (OLAP), while RDS/Aurora are transactional databases (OLTP). They are not substitutes but complementary.
- ElastiCache vs. Databases: ElastiCache is an in-memory cache to accelerate access to frequently accessed data, reducing the load on the primary database.
Configurations, Limitations, Best Practices, and Use Cases
Amazon RDS:
- Configurations: Choose engine, instance type, storage, Multi-AZ, read replicas, parameter and option groups.
- Limitations: Vertical scalability limited by instance type. Read replicas are asynchronous.
- Best Practices: Use Multi-AZ for high availability. Use read replicas to scale read workloads. Monitor performance metrics and optimize queries.
- Use Cases: Web and mobile applications, content management systems, CRMs, ERPs.
Amazon Aurora:
- Configurations: Choose engine (MySQL/PostgreSQL), instance type, configure read replicas, Aurora Serverless for on-demand scalability.
- Limitations: MySQL/PostgreSQL compatible, but not 100% identical. Costs can be higher than standard RDS for low-utilization workloads.
- Best Practices: Use the reader endpoint to distribute read traffic among replicas. Use Aurora Serverless for intermittent or unpredictable workloads. Monitor with CloudWatch.
- Use Cases: High-performance applications, gaming, SaaS, e-commerce.
Amazon DynamoDB:
- Configurations: Define tables, primary keys (partition and sort key), attributes, capacity modes (provisioned or on-demand), global/local secondary indexes.
- Limitations: Not a relational database. Does not natively support complex joins or multi-table transactions (although there are transactions for multiple items/tables in the same account/region).
- Best Practices: Design primary keys well to distribute workload. Use on-demand mode for unpredictable workloads. Use DynamoDB Accelerator (DAX) for in-memory cache. Enable DynamoDB Streams for event processing.
- Use Cases: User profiles, session data, shopping carts, gaming, IoT, stateless microservices.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Data-Driven Architecture: Choose the right database for each workload, combining different types (relational for transactions, NoSQL for flexible data, data warehouse for analytics).
- Security Patterns: Use IAM Roles for database access. Encrypt data at rest (KMS) and in transit (SSL/TLS). Use Security Groups to control network access. Implement the principle of least privilege.
- Resilience: Use Multi-AZ for RDS and Aurora for high availability. DynamoDB is inherently distributed and resilient. Use regular backups and snapshots.
- Governance: Use AWS Config to monitor database configuration compliance. Implement tagging policies for cost and resource tracking.
Migration and Modernization Strategies
- Database Migration: Use AWS DMS to migrate existing databases to AWS, with support for homogeneous and heterogeneous migrations. Use AWS SCT to convert schemas.
- Modernization: Refactor applications to use serverless databases like DynamoDB or Aurora Serverless. Migrate from self-managed databases to managed services like RDS or Aurora to reduce operational overhead.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: An e-commerce application needs a database to store product and order information. The application expects a high volume of transactions and needs high availability and scalability.
- Architectural Decision: For product and order information requiring a relational schema and ACID transactions, Amazon Aurora (MySQL or PostgreSQL compatible) with Multi-AZ and read replicas would be an excellent choice due to its high performance and scalability. For user session data or shopping carts (which can be more flexible and high-throughput), Amazon DynamoDB would be a good option.
Common Pitfall: Trying to force a relational database for a NoSQL use case (or vice versa). For example, using RDS to store high-ingestion IoT data, which can lead to performance and scalability issues. Or using DynamoDB for data that requires complex joins and multi-table transactions, which can complicate development and management.
2.4. Architect networking solutions.
Fundamental and Advanced Concepts
Networking solutions on AWS are the foundation for connecting all applications and services. Domain 2 deepens networking concepts, including VPCs, global connectivity, and content delivery services.
- Amazon VPC (Virtual Private Cloud): The fundamental service for creating your isolated virtual network on AWS. Allows you to define subnets (public and private), route tables, internet gateways, NAT gateways, and VPC endpoints. It is the foundation for network security and isolation.
- VPC Endpoints (Interface and Gateway): Allow you to connect to AWS services (like S3 and DynamoDB) and VPC endpoint services powered by PrivateLink without using an internet gateway, a NAT device, or a VPN connection. This keeps traffic within the AWS network, increasing security and reducing latency.
- AWS Transit Gateway: (Already covered in Domain 1) Acts as a central hub to connect VPCs and on-premises networks, simplifying routing and connectivity in complex and multi-account architectures.
- AWS PrivateLink: (Already covered in Domain 1) Allows you to expose services in your VPC to other VPCs or to on-premises networks privately and securely, without exposing traffic to the public Internet.
- VPC Peering: (Already covered in Domain 1) Connects two VPCs directly, allowing them to communicate as if they were in the same network. Ideal for a limited number of point-to-point connections.
- AWS Direct Connect and AWS VPN: (Already covered in Domain 1) Solutions for hybrid connectivity between on-premises and AWS, offering private and secure connections, respectively.
- Amazon Route 53: A scalable and highly available web DNS (Domain Name System) service. In addition to name resolution, it offers advanced routing features:
- Multi-Region DNS: Allows routing traffic to resources in different regions based on latency, geolocation, weighting, or resource health, crucial for high availability and disaster recovery.
- Routing Policies: Include simple routing, weighted, latency-based, failover, geolocation, multi-value answer, and proximity-based (with Global Accelerator).
- Amazon CloudFront: A fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs globally, with low latency and high transfer speeds. It caches content at edge locations (Points of Presence) close to users.
- Signed URLs and Signed Cookies: Allow you to control access to CloudFront content, granting temporary access to specific users. Useful for premium or private content.
- Geo-restrictions: Allow you to control which countries can access your CloudFront content.
- AWS Global Accelerator: A networking service that improves the availability and performance of your applications for global users. It routes traffic to the healthiest and closest application endpoint using the AWS global network, optimizing performance compared to the public Internet.
- AWS App Mesh: A service mesh that provides network visibility and control for your microservices-based applications. It standardizes how microservices communicate, providing features like traffic routing, retries, circuit breaking, and metrics.
- AWS Cloud Map: A resource discovery service that allows your applications to discover and connect to cloud services dynamically. It maintains a registry of all your application resources and their locations, and automatically updates them.
- VPC Lattice: A new service that simplifies connectivity, security, and observability between services in a microservices architecture. It allows you to define access and routing policies at a service level, regardless of the underlying network infrastructure. Facilitates communication between services in different VPCs and accounts.
Comparisons Between Services and When to Use Each One
AWS Transit Gateway vs. AWS PrivateLink vs. VPC Peering:
Characteristic | VPC Peering | AWS Transit Gateway | AWS PrivateLink |
---|---|---|---|
Connectivity | Point-to-point (VPC to VPC) | Hub-and-spoke (VPC to VPC, on-premises) | Private endpoint for services |
Transit Routing | Not supported | Supported | Not applicable (service access, not network) |
Scalability | Low (complex mesh) | High (central hub) | High (for service access) |
Complexity | Low for few VPCs | Medium for many VPCs | Medium (endpoint and service configuration) |
Cost | Low for few VPCs | Based on attachments and traffic | Based on endpoints and traffic |
Use Cases | Connect 2-3 VPCs | Enterprise networks, multi-account, hybrid | Private access to SaaS, internal services |
- CloudFront vs. Global Accelerator: Both improve performance and availability, but at different layers. CloudFront is a CDN for web content (Layer 7 - HTTP/HTTPS), optimizing static and dynamic content delivery. Global Accelerator operates at the network layer (Layer 4 - TCP/UDP), optimizing traffic routing for applications, regardless of the protocol. Use CloudFront for web content and Global Accelerator for applications that are not HTTP/HTTPS based or require static IP addresses.
Configurations, Limitations, Best Practices, and Use Cases
Amazon VPC:
- Configurations: Define CIDR blocks, create subnets (public/private), configure route tables, internet gateways, NAT Gateways, VPC Endpoints. Use Security Groups and NACLs for traffic control.
- Limitations: Limits on CIDR blocks, number of VPCs, subnets, Security Groups per region/account.
- Best Practices: Use a VPC model with public and private subnets. Use NAT Gateways for outbound access from private subnets. Use VPC Endpoints for secure access to AWS services. Monitor Flow Logs for traffic visibility.
- Use Cases: Environment isolation (dev, test, prod), multi-tier application hosting, hybrid networks.
Amazon Route 53:
- Configurations: Create hosted zones, DNS records (A, CNAME, MX, TXT, etc.), configure routing policies (simple, weighted, latency, failover, geolocation).
- Limitations: Limits on hosted zones and records per zone.
- Best Practices: Use failover routing policies for high availability. Use latency-based routing to optimize global performance. Use health checks to monitor endpoint health.
- Use Cases: DNS resolution for domains, global traffic routing, application failover.
Amazon CloudFront:
- Configurations: Create distributions, define origins (S3, EC2, Load Balancers), configure cache behaviors, SSL/TLS, signed URLs/Cookies, geo-restrictions.
- Limitations: Data transfer costs can be significant. Cache invalidations can take time.
- Best Practices: Use to deliver static and dynamic content. Optimize cache settings to improve performance and reduce costs. Use signed URLs/Cookies for restricted content.
- Use Cases: Web and application acceleration, video streaming, API distribution, software content delivery.
AWS Global Accelerator:
- Configurations: Create an accelerator, add listeners and endpoint groups (with EC2 instances, Load Balancers, Elastic IPs). Global Accelerator provides two static IP addresses.
- Limitations: Not a CDN. Does not cache content.
- Best Practices: Use for applications requiring high availability and global performance, especially for non-HTTP/HTTPS traffic. Use with health checks to route to healthy endpoints.
- Use Cases: Online gaming, VoIP, financial applications, applications requiring global static IP addresses.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Hub-and-Spoke Network Architecture: With Transit Gateway as the central hub, connecting VPCs from different environments (production, development, security) and on-premises networks.
- Security Patterns: Implement the principle of least privilege for network access. Use Security Groups and NACLs. Use AWS WAF and Shield for web application protection. Monitor traffic with VPC Flow Logs and CloudWatch.
- Resilience: Design multi-AZ and multi-Region networks. Use Load Balancers to distribute traffic. Implement DNS failover with Route 53.
- Governance: Use AWS Organizations to manage network accounts. Use AWS Config to monitor network configuration compliance. Use AWS Network Manager to manage and monitor your global network.
Migration and Modernization Strategies
- Network Migration: Migrate on-premises networks to VPCs using Direct Connect or VPN. Consolidate complex networks with VPC Peering into a Transit Gateway architecture.
- Modernization: Adopt VPC Lattice to simplify microservices connectivity and security. Use CloudFront and Global Accelerator to optimize global application delivery. Refactor applications to use VPC Endpoints for private access to AWS services.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A company has a global web application that needs to deliver static and dynamic content with low latency to users worldwide, plus a backend API that needs high performance and availability.
- Architectural Decision: Use Amazon CloudFront to deliver static (images, CSS, JS) and dynamic (HTML) content. Configure CloudFront with multiple origins in different regions. Use AWS Global Accelerator to route traffic to the closest and healthiest API endpoint, ensuring low latency and high availability. CloudFront can be configured to use Global Accelerator as an origin for the API.
Common Pitfall: Not optimizing DNS routing or content delivery for global users, resulting in high latency and poor user experience. Confusing the purpose of CloudFront and Global Accelerator and using one where the other would be more appropriate.
2.5. Architect messaging and integration solutions.
Fundamental and Advanced Concepts
Messaging and integration services are crucial for building distributed, decoupled, and scalable applications. They allow different components of an application to communicate asynchronously, increasing resilience and flexibility.
- Amazon Simple Notification Service (SNS): A fully managed pub/sub (publish/subscribe) messaging service. Allows sending messages to a large number of subscribers (applications, users, other AWS services) asynchronously. Ideal for notifications, event fanout, and microservices communication.
- Amazon Simple Queue Service (SQS): A fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Messages are stored in a queue until a consumer processes them. Supports standard queues (high throughput, best-effort ordering) and FIFO (First-In, First-Out) queues (guarantees ordering and exactly-once processing).
- Amazon EventBridge: A serverless event bus that makes it easy to connect applications using data from your own applications, software as a service (SaaS), and AWS services. It allows routing events from various sources to targets such as AWS Lambda, Amazon SQS, Amazon SNS, and others. Ideal for event-driven architectures and integration between systems.
- AWS Step Functions: A serverless workflow service that lets you orchestrate Lambda functions and other AWS services to build complex applications. It manages state, error handling, and retries, making it easier to build distributed workflows.
- Amazon Kinesis: A platform for real-time data streaming. Includes:
- Kinesis Data Streams: For collecting and processing large streams of data in real time.
- Kinesis Data Firehose: For delivering data streams to destinations such as S3, Redshift, Splunk, and others.
- Kinesis Data Analytics: For processing and analyzing streaming data in real time using SQL or Apache Flink.
- Amazon Managed Streaming for Apache Kafka (MSK): A fully managed service for Apache Kafka, making it easy to build and run applications that use Apache Kafka for real-time data processing.
- AWS AppSync: A fully managed GraphQL service that makes it easy to develop applications with real-time and offline data. It handles secure connection to data sources, such as DynamoDB, Lambda, and relational databases.
- Amazon API Gateway: A fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. Acts as a front-door for applications, allowing clients to access data, business logic, or functionality from your backend services.
- AWS IoT Core: A managed service that enables connected devices to securely and easily interact with cloud applications and other devices. Ideal for Internet of Things solutions.
Comparisons Between Services and When to Use Each One
SNS vs. SQS vs. EventBridge:
Characteristic | Amazon SNS | Amazon SQS | Amazon EventBridge |
---|---|---|---|
Model | Publish/Subscribe (Pub/Sub) | Message Queue | Event Bus |
Delivery | Push (to multiple subscribers) | Pull (by one consumer) | Event routing |
Use Cases | Notifications, fanout, microservices | Decoupling, asynchronous processing | Event-driven architectures, SaaS integration |
Durability | Messages delivered to active subscribers | Messages persistent in queue | Events routed to targets |
Ordering | Not guaranteed (standard) | Not guaranteed (standard), FIFO (guaranteed) | Not guaranteed |
Retries | Retry policies for HTTP/S endpoints | Configured on consumer | Configured in target rules |
- Use SNS when: You need to send the same message to multiple subscribers asynchronously. Ideal for notifications (SMS, email, push), event fanout to multiple downstream systems, or to trigger multiple Lambda functions in parallel.
- Use SQS when: You need to decouple application components, process messages asynchronously, and ensure each message is processed once (with FIFO queues). Ideal for work queues, batch processing, or handling traffic spikes.
- Use EventBridge when: You need to build event-driven architectures, integrate applications with AWS services or SaaS, or intelligently route events based on rules. It is more powerful than SNS for complex event routing and filtering.
Kinesis Data Streams vs. SQS:
- Kinesis Data Streams: Ideal for real-time data processing where record order is important and you need multiple consumers to process the same data stream. E.g., clickstreams, IoT data.
- SQS: Ideal for decoupling components and processing messages asynchronously, where each message is processed by a single consumer (or a group of consumers for standard queues). E.g., task queues, order processing.
Configurations, Limitations, Best Practices, and Use Cases
Amazon SNS:
- Configurations: Create topics, add subscribers (Lambda, SQS, HTTP/S, email, SMS), define access policies. Configure dead-letter queues (DLQ) for undelivered messages.
- Limitations: Maximum message size (256 KB). Does not guarantee delivery order for standard topics.
- Best Practices: Use for event fanout. Integrate with SQS for durability and asynchronous processing. Use DLQs to handle delivery failures.
- Use Cases: System notifications, S3 event fanout, triggering multiple microservices, alerts.
Amazon SQS:
- Configurations: Create queues (standard or FIFO), configure message visibility, delay, retention, access policies. Configure DLQs.
- Limitations: Maximum message size (256 KB). FIFO queues have lower throughput than standard queues.
- Best Practices: Use FIFO queues for workloads requiring ordering and exactly-once processing. Use standard queues for high throughput. Monitor the number of visible and invisible messages.
- Use Cases: Work queues, order processing, microservices decoupling, batch data processing.
Amazon EventBridge:
- Configurations: Create event buses, define rules with event patterns and targets. Use Schema Registry to manage event schemas.
- Limitations: Limits on the number of rules and targets per rule.
- Best Practices: Use to build event-driven architectures. Use rules to filter and route events to specific targets. Use Schema Registry to ensure event compatibility.
- Use Cases: Integration between AWS services, SaaS integration, event auditing, workflow automation.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Event-Driven Architecture: Use EventBridge as the central hub for events, with producers publishing events and consumers reacting to them. This promotes decoupling and scalability.
- Security Patterns: Use IAM Roles to grant minimum permissions to message producers and consumers. Encrypt messages in transit and at rest. Use access policies to control who can publish or consume messages.
- Resilience: Use message queues (SQS) to absorb traffic spikes and ensure message delivery. Use DLQs to handle messages that cannot be processed. Implement retries and exponential backoff.
- Governance: Use AWS CloudTrail to audit API operations on messaging services. Use AWS Config to monitor configuration compliance.
Migration and Modernization Strategies
- Migration: Replace on-premises messaging systems with SQS and SNS. Migrate legacy integration systems to EventBridge for an event-driven approach.
- Modernization: Refactor monolithic applications into microservices that communicate via events and queues. Adopt Step Functions to orchestrate complex and long-running workflows.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: An e-commerce application needs to process orders asynchronously. After an order is placed, several actions need to be executed: update inventory, send confirmation email, notify the logistics system, and update the CRM.
- Architectural Decision: The order service publishes a message to an SQS queue. A consumer (e.g., a Lambda function) reads the message from the queue and orchestrates subsequent actions using AWS Step Functions. Step Functions can call Lambda functions to update inventory and CRM, and publish to an SNS topic to send the confirmation email and notify the logistics system. This ensures robust, scalable, and fault-tolerant order processing.
Common Pitfall: Using SNS to decouple components when SQS would be more appropriate, resulting in lost messages if subscribers are unavailable. Or, conversely, using SQS for event fanout, which would require each consumer to poll the queue, instead of receiving the message via push.
Domain 3: Continuous Improvement for Existing Solutions
This domain focuses on optimizing and enhancing existing architectures to improve performance, resilience, security, and operational efficiency. The focus is on ensuring that AWS solutions continue to meet evolving business requirements and industry best practices.
Topics and Subtopics:
3.1. Architect solutions for high availability and resilience.
Fundamental and Advanced Concepts
High availability and resilience are critical pillars of any robust cloud architecture. AWS offers a range of services and patterns to ensure applications remain operational even in the face of failures.
- Disaster Recovery (DR) Strategies: Disaster recovery involves preparing for the recovery from an event that disrupts business operations. DR strategies are classified based on two main objectives:
- Recovery Time Objective (RTO): The maximum acceptable time to restore service after a disaster. A low RTO means the service must be restored quickly.
- Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. A low RPO means little to no data loss is tolerated.
DR strategies on AWS include:
- Backup and Restore: The simplest and lowest-cost method. Data is copied to another region or account and restored in case of disaster. RTO and RPO are higher.
- Pilot Light: A minimal version of the application is kept running in a DR region. In case of disaster, resources are scaled up to restore full functionality. RTO and RPO are lower than backup and restore.
- Warm Standby: A scaled-down version of the application is always running in the DR region, ready to take over the load. RTO and RPO are even lower.
- Multi-Site Active-Active (Multi-Region Active-Active): The application runs simultaneously in multiple regions, with traffic routed to both. Offers the lowest RTO and RPO (near zero), but is the most expensive and complex strategy.
- AWS Well-Architected Framework (Reliability Pillar): This pillar provides guidelines for designing systems that can recover from failures and continue to function as expected. It covers topics such as disaster recovery, design for failure, change management, and monitoring.
- Amazon Route 53: In addition to being a DNS service, Route 53 is fundamental for resilience, offering:
- Failover Routing: Routes traffic to an alternative resource if the primary resource becomes unhealthy.
- Latency-Based Routing: Routes traffic to the AWS region that offers the lowest latency for the user.
- AWS Global Accelerator: Improves the availability and performance of applications for global users, routing traffic to the healthiest and closest application endpoint using the AWS global network.
- Elastic Load Balancing (ELB): Distributes incoming traffic across multiple target instances in multiple Availability Zones, increasing availability and fault tolerance.
- Amazon EC2 Auto Scaling: Automatically adjusts the number of EC2 instances in your Auto Scaling group in response to defined conditions, ensuring the application has the necessary capacity to handle demand and recovers from instance failures.
- Amazon CloudFront: Acts as a CDN, caching content at edge locations close to users, which improves performance and resilience by reducing the load on origin servers.
- AWS Shield and AWS WAF: Security services that protect against DDoS attacks (Shield) and provide application-level protection against common web exploits (WAF).
- Resilience Patterns:
- Circuit Breaker: Prevents an application from repeatedly trying an operation that is likely to fail, protecting the system from overload and allowing it to recover.
- Sagas: A pattern for managing distributed transactions, ensuring data consistency in a microservices system, even when operations fail.
Comparisons Between Services and When to Use Each One
DR strategies (Backup and Restore, Pilot Light, Warm Standby, Multi-Site Active-Active) are choices that directly depend on RTO and RPO requirements. The lower the desired RTO and RPO, the more complex and expensive the strategy will be.
Configurations, Limitations, Best Practices, and Use Cases
DR Strategies:
- Configurations: Involve data replication (via S3 Cross-Region Replication, DynamoDB Global Tables, RDS Read Replicas, or DMS), infrastructure provisioning in the DR region, and DNS routing configuration (Route 53) for failover.
- Limitations: Cost and complexity increase significantly as RTO and RPO decrease.
- Best Practices: Define clear RTO and RPO for each application. Regularly test your DR plan. Automate the failover and recovery process whenever possible.
- Use Cases:
- Backup and Restore: File data, non-critical database backups.
- Pilot Light/Warm Standby: Critical business applications that can tolerate some minutes or hours of downtime.
- Multi-Site Active-Active: Mission-critical applications requiring continuous high availability (e.g., financial services, high-volume e-commerce).
Amazon EC2 Auto Scaling:
- Configurations: Define an Auto Scaling group, specify minimum, desired, and maximum size, configure scaling policies (simple, step-based, target tracking), and health checks.
- Limitations: Scalability takes time for new instances to be provisioned and initialized.
- Best Practices: Use CPU utilization or Load Balancer request metrics to scale. Use warm pools to reduce the startup time of new instances. Monitor the Auto Scaling group with CloudWatch.
- Use Cases: Web applications, microservices, any workload that needs elastic scalability and resilience to instance failures.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Multi-AZ Architecture: Deploy resources in at least two Availability Zones within a region to protect against AZ failures. Use Load Balancers and Auto Scaling to distribute traffic.
- Multi-Region Architecture: For the most stringent RTO/RPO requirements, deploy the application in multiple regions. Use services like Route 53 and Global Accelerator for traffic routing and failover.
- Security Patterns: Implement the principle of least privilege. Use Security Groups and NACLs. Encrypt data at rest and in transit. Use AWS Shield and WAF for edge protection.
- Resilience: Design for failures. Implement timeouts, retries, and circuit breakers. Use queues and messaging to decouple components. Regularly test resilience with Chaos Engineering.
- Governance: Use AWS Config to monitor compliance of resilience configurations. Use AWS Resilience Hub to assess and manage the resilience of your applications.
Migration and Modernization Strategies
- Modernization for Resilience: When modernizing, refactor applications to be stateless and use managed services that are inherently resilient (e.g., DynamoDB Global Tables, Aurora Multi-Master). Adopt microservices patterns to isolate failures.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A mission-critical video streaming application needs an RTO of seconds and an RPO of zero for its user session data.
- Architectural Decision: Implement a Multi-Site Active-Active DR strategy using DynamoDB Global Tables for session data (which offers active-active multi-Region replication with near-zero RPO) and Amazon Route 53 with failover routing to direct traffic to the healthy region.
Common Pitfall: Not regularly testing the disaster recovery plan. An untested DR plan is a DR plan that is likely to fail when most needed. Another pitfall is overestimating the resilience of a service without understanding its limitations (e.g., believing that Multi-AZ for RDS protects against entire region failures).
3.2. Architect solutions for performance optimization.
Fundamental and Advanced Concepts
Performance optimization on AWS involves selecting and configuring resources to ensure applications respond quickly and process data efficiently, while balancing with costs. The Performance Efficiency Pillar of the AWS Well-Architected Framework provides guidelines for this.
- AWS Well-Architected Framework (Performance Efficiency Pillar): This pillar focuses on the ability to use computing resources efficiently to meet system requirements and maintain that efficiency as demand changes. It covers topics such as resource selection, monitoring, network, and storage optimization.
- Compute Optimization:
- Instance Selection: Selecting the correct EC2 instance type (family, size) for the workload. Instances optimized for compute, memory, storage, or network.
- Container Optimization: Optimizing container images to be small and efficient. Using Fargate to abstract server management.
- Lambda Provisioned Concurrency: Pre-initializes Lambda function instances to reduce cold starts, improving latency for time-sensitive applications.
- Storage Optimization:
- S3 Class Selection: Using the appropriate S3 storage class (Standard, IA, Glacier) based on access frequency to optimize costs and retrieval performance.
- EBS Optimization: Selecting the correct EBS volume type (gp3, io2 Block Express) and provisioning adequate IOPS/throughput for the workload.
- EFS/FSx Optimization: Choosing the correct performance and throughput modes for EFS, and the appropriate FSx type for the use case (Lustre for HPC, Windows for SMB).
- Database Optimization:
- DB Type Selection: Selecting the database type (relational, NoSQL, etc.) that best suits the application’s access patterns and scalability requirements.
- Query Optimization: Optimizing SQL and NoSQL queries to reduce execution time and resource consumption.
- Caching with ElastiCache: Using Amazon ElastiCache (Redis or Memcached) to cache frequently accessed data, reducing the load on the database and improving latency.
- Network Optimization:
- Direct Connect: For high-bandwidth, low-latency on-premises connections.
- Global Accelerator: For optimized global traffic routing for non-HTTP/HTTPS applications.
- CloudFront: For low-latency, high-throughput content delivery via CDN.
- Caching: Implementing caching strategies at different layers of the architecture (CDN with CloudFront, application cache with ElastiCache, object cache in S3 for static data).
- Content Delivery Networks (CDNs): Geographically distributed networks of servers that deliver web and multimedia content to users based on their location, improving loading speed and reducing the load on origin servers.
Comparisons Between Services and When to Use Each One
- ElastiCache (Redis vs. Memcached):
- Redis: More feature-rich, supports complex data structures (lists, sets, hashes), data persistence, replication, and clustering. Ideal for caching, leaderboards, counters, message queues.
- Memcached: Simpler, focused on object caching. Ideal for simple data caching and horizontal scalability.
Configurations, Limitations, Best Practices, and Use Cases
Compute Optimization (EC2):
- Configurations: Monitor CPU, memory, and network utilization. Use AWS Compute Optimizer for right-sizing recommendations.
- Limitations: Over-provisioning leads to unnecessary costs; under-provisioning leads to performance issues.
- Best Practices: Use the most suitable instance type for the workload. Use Auto Scaling to dynamically adjust capacity. Consider instances with optimized processors (Graviton2/3) for better performance/cost.
- Use Cases: High-traffic web applications, microservices, data processing.
Amazon ElastiCache:
- Configurations: Choose engine (Redis/Memcached), node type, number of nodes, Multi-AZ, security groups.
- Limitations: Cache is volatile (data can be lost). Requires application logic to manage the cache.
- Best Practices: Store frequently accessed data that does not change often. Implement cache-aside or write-through strategies. Monitor cache hit and miss rates.
- Use Cases: Caching database query results, user session data, game leaderboards, news feeds.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Layered Caching Architecture: Combine CDN (CloudFront) for edge caching, ElastiCache for application/database caching, and object caching in S3 for static data.
- Security Patterns: Use IAM Roles for access to caching services. Encrypt data in transit and at rest in the cache (Redis supports encryption).
- Resilience: Use ElastiCache with Multi-AZ for high availability. Implement fallback logic in the application if the cache is unavailable.
- Governance: Use AWS Config to monitor compliance of performance configurations. Use AWS Trusted Advisor for performance optimization recommendations.
Migration and Modernization Strategies
- Modernization for Performance: Refactor applications to be stateless and use caching extensively. Migrate to serverless and managed services that offer optimized scalability and performance. Adopt microservices to isolate and optimize components individually.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: An e-commerce application suffers from high latency in database queries during traffic spikes, impacting user experience.
- Architectural Decision: Implement Amazon ElastiCache (Redis) to cache frequently accessed query results (e.g., product details, categories). The application first checks the cache; if data is not there, it queries the database and stores the result in the cache for future requests.
Common Pitfall: Caching data that changes frequently, leading to stale data. Not invalidating the cache correctly when source data changes. Not monitoring the cache hit ratio, which can indicate that the cache is not effective.
3.3. Architect solutions for monitoring and observability.
Fundamental and Advanced Concepts
Monitoring and observability are essential for understanding application behavior, identifying problems, optimizing performance, and ensuring security. AWS offers a robust set of tools to collect, analyze, and visualize telemetry data.
- Amazon CloudWatch: A monitoring and management service that provides actionable data and insights for AWS, hybrid, and on-premises applications. It collects and tracks metrics, collects and monitors log files, and triggers alarms based on defined thresholds.
- Metrics: Data points that represent a resource’s performance over time (e.g., CPU utilization, disk I/O, Load Balancer requests).
- Logs: Records of events from applications and services (e.g., EC2 logs, Lambda, VPC Flow Logs).
- Alarms: Trigger actions (e.g., SNS notification, Auto Scaling) when a metric exceeds a threshold.
- Dashboards: Customizable visualizations of metrics and logs.
- AWS X-Ray: Helps developers analyze and debug distributed applications, such as those built using a microservices architecture. It provides an end-to-end view of requests as they travel through application components, showing application performance, bottlenecks, and errors.
- AWS CloudTrail: Records AWS API calls made in your account, including those made by users, roles, and AWS services. These event logs can be used for security auditing, resource change tracking, and operational troubleshooting.
- AWS Config: Allows you to assess, audit, and evaluate the configurations of your AWS resources. It continuously monitors resource configurations and compares them against best practices or defined policies, alerting on non-compliance.
- Conformance Packs: Collections of AWS Config rules and actions that can be deployed as a single entity to help audit and report on the compliance of a group of resources.
- Amazon Kinesis (for log and metric ingestion): Can be used to collect large volumes of logs and metrics in real time from various sources for downstream processing and analysis.
- AWS Systems Manager (OpsCenter, Explorer): Provides a centralized view of operational data from your AWS resources and allows you to automate the identification and remediation of issues.
- AWS Health Dashboard: Provides a personalized view of the health of AWS services and events that may affect your resources.
- AWS Trusted Advisor: A service that provides recommendations for optimizing your AWS resources in five categories: cost optimization, performance, security, fault tolerance, and service limits.
- Feature flags with CloudWatch Evidently: Allows developers to launch new features to a subset of users, monitor their performance and impact, and quickly enable or disable the feature based on real-time metrics. This is part of an observability-driven development strategy.
Comparisons Between Services and When to Use Each One
CloudWatch vs. X-Ray vs. CloudTrail:
Characteristic | Amazon CloudWatch | AWS X-Ray | AWS CloudTrail |
---|---|---|---|
Purpose | Performance monitoring and logs | Distributed request tracing | API call auditing and governance |
Data Collected | Metrics, logs, events | Request traces, segments, subsegments | API events (who, what, when, where) |
Use Cases | Alarms, dashboards, scalability | Microservices debugging, latency analysis | Security auditing, compliance, operational troubleshooting |
View | Resource health and performance | Request flow through services | Account activity history |
- Use CloudWatch when: You need to monitor the health and performance of your resources and applications, collect logs, and set up alarms. It is the central tool for monitoring on AWS.
- Use X-Ray when: You need to understand the flow of requests through a distributed application, identify performance bottlenecks, and debug errors in microservices. It complements CloudWatch for application observability.
- Use CloudTrail when: You need to audit activities in your AWS account for security, compliance, and troubleshooting. It records all actions performed in the account.
Configurations, Limitations, Best Practices, and Use Cases
Amazon CloudWatch:
- Configurations: Create alarms based on metrics, configure custom dashboards, send application logs to CloudWatch Logs, use CloudWatch Events for automation.
- Limitations: Log and metric retention can incur costs. Standard metric granularity (1 minute).
- Best Practices: Set alarms for critical metrics. Use structured logs (JSON) for easier analysis. Create dashboards for quick visibility. Use CloudWatch Agent to collect metrics and logs from EC2 instances and on-premises.
- Use Cases: Resource utilization monitoring, application error detection, security alerts, automatic scaling.
AWS X-Ray:
- Configurations: Enable X-Ray for your services (Lambda, EC2, ECS), configure the X-Ray SDK in your application, use the X-Ray Daemon to send trace data.
- Limitations: Minimal performance overhead on the application. Requires code instrumentation.
- Best Practices: Instrument all application components. Use annotations and metadata to add context to traces. Monitor the X-Ray service map to identify dependencies and bottlenecks.
- Use Cases: Distributed application debugging, end-to-end latency analysis, microservices performance optimization.
AWS Config:
- Configurations: Enable AWS Config for desired resource types, create managed or custom rules, configure conformance packs.
- Limitations: Costs based on the number of recorded configurations and evaluated rules.
- Best Practices: Use rules to enforce security and compliance standards. Use conformance packs for large-scale compliance audits. Integrate with AWS Security Hub for aggregation of security findings.
- Use Cases: Compliance auditing, configuration change monitoring, security deviation detection.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Centralized Observability Architecture: Aggregate logs and metrics from all accounts and regions into a centralized logging account. Use CloudWatch, X-Ray, and CloudTrail for a unified view.
- Security Patterns: Use CloudTrail to audit all actions in the account. Use AWS Config to monitor security compliance. Integrate with Security Hub for a consolidated view of the security posture.
- Resilience: Use CloudWatch metrics and alarms to detect problems and trigger automatic recovery actions (e.g., Auto Scaling, Lambda).
- Governance: Use AWS Config to ensure resources are configured according to organizational policies. Use CloudTrail to track who did what and when.
Migration and Modernization Strategies
- Modernization for Observability: When modernizing, adopt an observability-driven development approach. Instrument applications with X-Ray. Use structured logs. Implement feature flags with CloudWatch Evidently for controlled releases and impact monitoring.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A microservices application is experiencing high latency and intermittent errors, but it’s difficult to identify the root cause due to the complexity of the distributed architecture.
- Architectural Decision: Implement AWS X-Ray to trace requests across all microservices. X-Ray will provide a visual service map, showing dependencies, latencies at each step, and any errors, allowing quick identification of the service or component causing the problem.
Common Pitfall: Collecting too much monitoring data without a plan to analyze or act on it, resulting in noise and making it difficult to identify real problems. Another pitfall is not configuring adequate alarms, which means problems are only discovered when they are already impacting users.
3.4. Architect solutions for continuous security.
Fundamental and Advanced Concepts
Security on AWS is a shared responsibility between AWS and the customer. AWS is responsible for security of the cloud, while the customer is responsible for security in the cloud. This domain focuses on how to architect solutions to maintain a robust and continuous security posture, using AWS security services.
- AWS Security Hub: Provides a comprehensive, high-level view of your security posture in AWS. It aggregates, organizes, and prioritizes your security alerts (or findings) from various AWS services (such as GuardDuty, Inspector, Macie) and AWS partner products. Allows you to run automated security checks against industry standards and best practices.
- Amazon GuardDuty: A threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts and workloads. It uses machine learning, anomaly detection, and threat intelligence to identify threats such as cryptocurrency mining, unauthorized data access, and denial-of-service attacks.
- Amazon Detective: A service that makes it easy to analyze, investigate, and quickly identify the root cause of security findings or suspicious activities. It automatically collects log data from your AWS resources and uses machine learning, statistical analysis, and graph theory to build a unified model of your activities, making it easier to conduct security investigations.
- AWS Inspector: An automated vulnerability management service that helps improve the security and compliance of applications deployed on AWS. It automatically discovers and continuously assesses EC2 instances, container images in Amazon ECR, and Lambda functions for vulnerabilities and deviations from best practices.
- AWS WAF (Web Application Firewall): Helps protect your web applications or APIs against common web exploits that may affect availability, compromise security, or consume excessive resources. Allows you to create custom rules to block or allow traffic based on conditions you specify.
- AWS Shield: A managed Distributed Denial of of Service (DDoS) protection service that safeguards applications running on AWS. It offers two tiers: Standard (automatic protection for all AWS customers) and Advanced (enhanced protection for critical applications, with more sophisticated detection and mitigation).
- AWS Firewall Manager: A security management service that allows you to centrally configure and manage firewall rules across your accounts and applications in AWS Organizations. Simplifies the deployment and auditing of AWS WAF, AWS Shield Advanced, Security Groups, AWS Network Firewall, and Route 53 Resolver DNS Firewall rules.
- AWS Certificate Manager (ACM): Makes it easy to provision, manage, and deploy SSL/TLS certificates for use with AWS services (such as ELB, CloudFront, API Gateway). Helps secure network communication and establish the identity of websites and applications.
- AWS Secrets Manager: Helps protect access to your applications, services, and IT resources. Enables you to rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. Reduces the need to hardcode secrets in plain text.
- AWS Systems Manager Parameter Store: Provides secure, hierarchical storage for configuration data and secrets management. Can be used to store passwords, database connection strings, license codes, and other values that need to be accessed by your applications.
- AWS Single Sign-On (AWS SSO) / AWS IAM Identity Center: A cloud service that makes it easy to centrally manage access to multiple AWS accounts and business applications. Allows users to sign in once to access all their assigned resources and applications.
- Principle of Least Privilege: A fundamental security practice that states that each user, program, or process should have only the minimum permissions necessary to perform its function. On AWS, this is implemented through granular IAM policies.
- Security Automation: Using services like AWS Security Hub and AWS Config Rules to automate the detection of security non-compliance and, in some cases, remediation.
Comparisons Between Services and When to Use Each One
AWS Security Hub vs. Amazon GuardDuty vs. Amazon Detective vs. AWS Inspector:
Service | Primary Purpose | Analysis Type | Use Cases |
---|---|---|---|
Security Hub | Consolidated view of security posture, compliance | Aggregation and prioritization of findings, standard checks | Centralized security management, compliance auditing, reporting |
GuardDuty | Threat detection and malicious activity | Machine learning, anomaly detection, threat intelligence | Account compromise identification, unauthorized access, cryptocurrency mining |
Detective | Investigation and root cause analysis of incidents | Graph analysis, machine learning, log data | Forensic investigation, suspicious activity identification, behavior analysis |
Inspector | Vulnerability assessment in workloads | Vulnerability analysis in EC2, containers, Lambda | CVE identification, security best practice deviations in applications |
- These services are complementary and should be used together for a comprehensive security posture. Security Hub acts as an orchestrator and aggregator, while GuardDuty, Detective, and Inspector provide specific insights into threats, investigations, and vulnerabilities, respectively.
AWS WAF vs. AWS Shield:
- AWS WAF: Protects web applications against application-level attacks (Layer 7), such as SQL injection, cross-site scripting (XSS). You define rules to allow or block traffic.
- AWS Shield: Protects against Distributed Denial of Service (DDoS) attacks at network (Layer 3) and transport (Layer 4) levels, and also at the application level (Layer 7) with Shield Advanced. It is broader protection against denial-of-service attacks.
Configurations, Limitations, Best Practices, and Use Cases
AWS Security Hub:
- Configurations: Enable Security Hub, integrate with other AWS services (GuardDuty, Config, Macie), configure security standards (AWS Foundational Security Best Practices, CIS AWS Foundations Benchmark).
- Limitations: Requires enabling other security services to collect findings. Not a direct remediation tool.
- Best Practices: Use Security Hub as your central security dashboard. Automate response to findings using EventBridge and Lambda. Regularly review findings and prioritize remediation.
- Use Cases: Centralized security management, compliance auditing, incident response.
Amazon GuardDuty:
- Configurations: Enable GuardDuty in all regions. Configure notifications for high-severity findings.
- Limitations: Does not block traffic. Only detects threats.
- Best Practices: Enable GuardDuty in all accounts and regions. Integrate findings with Security Hub and SIEM systems. Use to identify suspicious activity in real time.
- Use Cases: Credential compromise detection, unauthorized instance access, unusual port usage, cryptocurrency mining.
AWS KMS (Key Management Service):
- Configurations: Create KMS keys (customer managed keys - CMKs), define key policies, configure key rotation. Use multi-Region keys for key replication across regions.
- Limitations: Costs per key and per API request. Request rate limits.
- Best Practices: Use KMS to encrypt data at rest in AWS services (S3, EBS, RDS, Lambda). Use separate keys for different applications or environments. Implement the principle of least privilege for key access.
- Use Cases: Data encryption at rest, encryption key management, digital signatures.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Layered Security Architecture (Defense in Depth): Implement security at all layers of the architecture, from network (VPC, Security Groups, NACLs, WAF, Shield) to application (IAM, Secrets Manager, KMS) and data (encryption).
- Security Patterns: Implement the principle of least privilege in all IAM policies. Use cross-account roles for secure access between accounts. Automate the detection and response to security incidents.
- Resilience: Use AWS managed security services that are inherently resilient (e.g., Shield, WAF). Design so that security failures do not lead to a complete service interruption.
- Governance: Use AWS Organizations and SCPs to enforce security policies across the organization. Use AWS Config and Conformance Packs to monitor security compliance. Conduct regular security audits.
Migration and Modernization Strategies
- Modernization for Security: When modernizing, adopt a shift-left security approach, integrating security into the development lifecycle. Use automated security tools. Migrate from manual secret management to AWS Secrets Manager. Adopt ABAC for scalable permission management.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A company needs to ensure that no EC2 instance is launched without EBS volume encryption in its production account, and that all suspicious activities are detected and investigated.
- Architectural Decision: Use AWS Config to create a rule that detects EC2 instances with unencrypted EBS volumes and, optionally, a Lambda function to automatically remediate. Enable Amazon GuardDuty for threat detection and Amazon Detective for incident investigation. GuardDuty and Config findings would be aggregated in AWS Security Hub for a centralized view.
Common Pitfall: Granting excessive permissions (elevated privilege) to users or roles, violating the principle of least privilege. Not regularly reviewing IAM policies and security configurations. Relying solely on a single layer of security, instead of implementing defense in depth.
Domain 4: Cost and Performance Optimization
This domain focuses on the ability to design solutions that are cost and performance efficient. This involves selecting appropriate services and resources, implementing optimization strategies, and continuous monitoring to ensure spending is controlled and performance meets requirements.
Topics and Subtopics:
4.1. Architect cost optimization strategies.
Fundamental and Advanced Concepts
Cost optimization on AWS is a continuous process that aims to maximize the value of cloud investments. The Cost Optimization Pillar of the AWS Well-Architected Framework provides guidelines for achieving this goal.
- AWS Well-Architected Framework (Cost Optimization Pillar): This pillar focuses on the ability to run systems to deliver business value at the lowest price. It covers topics such as adopting a consumption model, measuring overall efficiency, stopping spending on undifferentiated heavy lifting, and analyzing and attributing expenditures.
- AWS Pricing Models:
- On-Demand: Pay for compute capacity by the hour or second, with no long-term commitment. Ideal for unpredictable or short-term workloads.
- Reserved Instances (RIs): Commitment to use EC2, RDS, Redshift, ElastiCache instances for 1 or 3 years in exchange for a significant discount (up to 75% compared to On-Demand). Can be Standard (less flexible) or Convertible (more flexible).
- Savings Plans: A flexible pricing model that offers lower prices in exchange for a consistent usage commitment (measured in USD/hour) for a 1 or 3-year period. Covers EC2, Fargate, and Lambda. Offers more flexibility than RIs.
- Spot Instances: Allow you to request unused EC2 capacity from AWS at steep discounts (up to 90% compared to On-Demand). Ideal for interruption-tolerant workloads, such as batch processing, rendering, or testing.
- Resource Optimization:
- Rightsizing: Adjusting the size of resources (EC2 instances, EBS volumes, etc.) to match workload demand, avoiding over-provisioning.
- Elasticity: Using Auto Scaling to automatically scale resources up and down based on demand, ensuring you only pay for the capacity you actually use.
- Shutting Down Idle Resources: Identifying and shutting down unused resources (e.g., development instances outside business hours).
- Cost Management and Visibility:
- AWS Cost Explorer: A tool to visualize, understand, and manage your AWS costs and usage over time. Allows you to analyze trends, identify high-cost areas, and forecast future spending.
- AWS Budgets: Allows you to set custom budgets that alert you when your costs or usage exceed (or are forecasted to exceed) your budgeted thresholds.
- AWS Cost and Usage Report (CUR): Provides a comprehensive set of data about your AWS costs and usage, enabling detailed analysis and integration with BI tools.
- AWS Compute Optimizer: A service that uses machine learning to analyze the configuration and utilization metrics of your AWS compute resources (EC2, Auto Scaling Groups, Lambda, EBS, Fargate) and recommends optimal resources to reduce costs and improve performance.
- AWS Trusted Advisor: Offers recommendations for cost optimization, performance, security, fault tolerance, and service limits.
- Multi-Account Strategies for Cost Optimization: Using AWS Organizations for consolidated billing and applying cost policies across the organization.
Comparisons Between Services and When to Use Each One
Reserved Instances (RIs) vs. Savings Plans:
Characteristic | Reserved Instances (RIs) | Savings Plans |
---|---|---|
Commitment | Use of a specific instance configuration (type, region, OS) | Consistent spend (USD/hour) for a period |
Flexibility | Lower (instance/region specific) | Higher (applies to EC2, Fargate, Lambda) |
Types | Standard (less flexible, higher discount), Convertible (more flexible, lower discount) | Compute Savings Plans, EC2 Instance Savings Plans |
Applicability | EC2, RDS, Redshift, ElastiCache, DynamoDB | EC2, Fargate, Lambda |
AWS Recommendation | Savings Plans are generally preferred due to flexibility |
- Use Savings Plans when: You have consistent compute usage (EC2, Fargate, Lambda) and want the highest discount with flexibility to change instance types or regions.
- Use Reserved Instances when: You have very stable and predictable usage of a specific instance type (especially for RDS, Redshift, ElastiCache, where Savings Plans do not apply).
Configurations, Limitations, Best Practices, and Use Cases
Cost Optimization:
- Configurations: Enable AWS Cost Explorer and AWS Budgets. Configure AWS Compute Optimizer. Use lifecycle policies for S3. Configure Auto Scaling to scale down.
- Limitations: Cost optimization is a continuous process and requires regular monitoring and adjustments. Not all workloads are suitable for Spot Instances.
- Best Practices: Start with right-sizing. Use the consumption model (pay for what you use). Leverage volume discounts. Use Spot Instances for fault-tolerant workloads. Use Savings Plans or RIs for baseline workloads. Monitor and analyze costs regularly.
- Use Cases: Cost reduction in development/test environments, production workload optimization, budget planning.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Serverless Architecture: Often the most cost-efficient, as you pay only for actual usage, without server provisioning.
- Security Patterns: Implement the principle of least privilege to avoid over-provisioning resources. Monitor resource usage to identify anomalies that may indicate malicious activity.
- Resilience: Elasticity (Auto Scaling) that optimizes costs also contributes to resilience, ensuring capacity is adjusted to demand.
- Governance: Use tagging for cost attribution. Implement budget policies. Use AWS Organizations to manage costs across multiple accounts.
Migration and Modernization Strategies
- Migration for Cost Optimization: When migrating, consider refactoring applications to use managed and serverless services that offer a more granular and optimized cost model. Use AWS Migration Hub to track and optimize the migration process.
- Modernization for Cost Optimization: Adopt microservices and containerization to optimize resource usage. Use serverless services whenever possible. Implement FinOps practices to continuously manage and optimize cloud costs.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: A company has a large fleet of EC2 instances for batch data processing that can be interrupted and restarted. They want to reduce compute costs.
- Architectural Decision: Use Spot Instances for the EMR cluster’s task nodes. Since task nodes do not store data, an interruption of a Spot instance can be tolerated. This can significantly reduce compute costs. For master and core nodes, use On-Demand or Reserved Instances to ensure cluster stability.
Common Pitfall: Not considering the total cost of ownership (TCO) when migrating to the cloud, focusing only on infrastructure costs and ignoring operational, licensing, or migration costs. Another pitfall is purchasing RIs or Savings Plans without careful analysis of future usage, resulting in unused capacity and wasted money.
4.2. Architect solutions for performance optimization.
Fundamental and Advanced Concepts
Performance optimization has already been covered in detail in Domain 3 (Section 3.2). In the context of Domain 4, the focus is on the intersection between performance and cost, seeking the ideal balance to deliver business value.
- AWS Well-Architected Framework (Performance Efficiency Pillar): As a reminder, this pillar focuses on the ability to use computing resources efficiently to meet system requirements and maintain that efficiency as demand changes. Performance optimization often leads to cost optimization, as more efficient resources can mean fewer resources needed.
- Key Services and Strategies (review and deep dive with a cost focus):
- Rightsizing: Use AWS Compute Optimizer to identify and resize resources (EC2, EBS, Lambda, Fargate) to the optimal size, which optimizes both performance and cost.
- Caching: Implement caching strategies (CloudFront, ElastiCache) to reduce latency and load on backend services, which can lead to a lower need for compute resources and, consequently, lower costs.
- Elasticity and Scalability: Use Auto Scaling and serverless services (Lambda, Fargate, DynamoDB) that automatically scale with demand. This ensures you have the necessary performance at peaks, but don’t pay for idle capacity during low demand periods.
- Database Optimization: Choose the correct database type for the workload (relational, NoSQL, etc.) and optimize queries. Using DynamoDB with its on-demand capacity model, for example, can be more cost-efficient for workloads with traffic spikes than a relational database provisioned for the peak.
- Network Optimization: Utilize services like CloudFront and Global Accelerator to improve content delivery performance and traffic routing, which can reduce the need for backend resources and, therefore, costs.
- Intelligent Storage: Use S3 storage classes like Intelligent-Tiering, which automatically move data to infrequent access or archival classes based on access patterns, optimizing costs without compromising access performance when needed.
Comparisons Between Services and When to Use Each One
- Compute Optimizer vs. Manual Analysis: Compute Optimizer automates performance and cost analysis, providing machine learning-based recommendations, which is more efficient and accurate than manual analysis, especially in large and dynamic environments.
Configurations, Limitations, Best Practices, and Use Cases
Performance and Cost Optimization:
- Configurations: Enable AWS Compute Optimizer. Monitor performance and cost metrics in CloudWatch and Cost Explorer. Configure lifecycle policies for S3 Intelligent-Tiering.
- Limitations: Optimization is a trade-off. Over-optimizing cost can impact performance, and vice versa. It is crucial to find the right balance for business requirements.
- Best Practices: Adopt a FinOps culture. Regularly review Compute Optimizer recommendations. Test optimizations in non-production environments before applying them in production. Automate the shutdown of unused resources.
- Use Cases: Infrastructure cost reduction, improved user experience, ensuring applications meet performance SLAs.
Recommended Architectural Models, Security Patterns, Resilience, and Governance
- Elastic and Serverless Architecture: Allows capacity to adjust to demand, optimizing both performance (sufficient capacity at peaks) and cost (no payment for idle capacity).
- Security Patterns: Resource optimization (right-sizing) also contributes to security, as fewer exposed resources mean a smaller attack surface.
- Resilience: Elasticity is a key component of resilience. The ability to scale up and down quickly helps maintain performance and availability under variable load.
- Governance: Use tags to track resource usage and cost. Implement resource management policies to ensure resources are continuously optimized.
Migration and Modernization Strategies
- Modernization for Performance and Cost: Refactor applications to use serverless and managed services that offer automatic scalability and a consumption-based cost model. This generally results in better performance and lower costs compared to running workloads on self-managed VMs.
Practical Scenario Examples that Appear in the Exam, Architectural Decisions, and Common Pitfalls
Scenario: An e-commerce web application has variable traffic throughout the day, with significant spikes during promotions. The company wants to ensure performance during peaks and optimize costs during low demand periods.
- Architectural Decision: Use Amazon EC2 Auto Scaling for the application backend, configured with scaling policies based on CPU utilization or Load Balancer request count. For the database, consider Amazon Aurora Serverless or DynamoDB On-Demand, which automatically scale capacity and charge only for actual usage, eliminating the need to provision for the peak.
Common Pitfall: Not implementing Auto Scaling or using a fixed provisioning model for workloads with variable demand, resulting in over-provisioning (high cost) or under-provisioning (poor performance). Ignoring Compute Optimizer recommendations or not acting on them.
Comparative Tables and Summaries
This section consolidates important information into tables to facilitate quick reference and architectural decision-making.
Disaster Recovery (DR) Strategies
Strategy | RTO (Recovery Time) | RPO (Data Loss) | Cost | Complexity | Use Case |
---|---|---|---|---|---|
Backup and Restore | Hours to days | Hours | Low | Low | File data, non-critical backups |
Pilot Light | Tens of minutes to hours | Minutes | Low-Medium | Medium | Critical business applications with some tolerance |
Warm Standby | Minutes to tens of minutes | Seconds to minutes | Medium | Medium-High | Important business applications with low tolerance |
Multi-Site Active-Active | Seconds (or zero) | Zero (or near zero) | High | High | Mission-critical applications with zero tolerance |
Reserved Instances (RIs) vs. Savings Plans
Characteristic | Reserved Instances (RIs) | Savings Plans |
---|---|---|
Commitment | Use of a specific instance configuration (type, region, OS) | Consistent spend (USD/hour) for a period |
Flexibility | Lower (instance/region specific) | Higher (applies to EC2, Fargate, Lambda across regions/types) |
Types | Standard (less flexible, higher discount), Convertible (more flexible) | Compute Savings Plans, EC2 Instance Savings Plans, SageMaker Savings Plans |
Applicability | EC2, RDS, Redshift, ElastiCache, DynamoDB | EC2, Fargate, Lambda, SageMaker |
AWS Recommendation | Savings Plans are generally preferred due to greater flexibility for compute. RIs are still relevant for RDS, Redshift, etc. |
Common Auto Scaling Triggers
Metric/Trigger | Description | Common Use Case |
---|---|---|
CPU Utilization | Scales based on the average CPU utilization of the Auto Scaling group. | Web applications, CPU-intensive processing |
Request Count per Target | Scales based on the number of requests per instance in the Load Balancer. | Web applications, APIs |
Network Traffic (In/Out) | Scales based on the amount of network traffic per instance. | Streaming applications, data transfer |
SQS Queue Size | Scales based on the number of messages in an SQS queue. | Asynchronous job processing, batch workloads |
Custom Metrics | Scales based on any custom CloudWatch metric. | Workloads with specific business metrics (e.g., active users) |
Scheduled Scaling | Scales at specific times of day or week. | Applications with predictable traffic patterns (e.g., business hours) |
Comparison of Messaging Services: SNS vs. SQS vs. EventBridge
Characteristic | Amazon SNS | Amazon SQS | Amazon EventBridge |
---|---|---|---|
Model | Publish/Subscribe (Pub/Sub) | Message Queue | Event Bus |
Delivery | Push (to multiple subscribers) | Pull (by one consumer) | Event routing to multiple targets |
Use Cases | Notifications, fanout, microservices | Decoupling, asynchronous processing | Event-driven architectures, SaaS integration |
Durability | Messages delivered to active subscribers | Messages persistent in queue | Events routed to targets |
Filtering | Simple (attribute-based) | No (consumer processes all messages) | Advanced (based on event content) |
Detailed Coverage of Mandatory Topics
This section deepens the most critical and frequently covered topics in the exam, consolidating knowledge from previous domains.
1. Multi-Account Strategies and AWS Organizations
- AWS Organizations: Allows you to centrally manage multiple AWS accounts. Facilitates policy application, consolidated billing, and security management.
- Landing Zone (via AWS Control Tower): A reference architecture for setting up a secure and scalable multi-account environment. Control Tower automates the creation of the Landing Zone, establishing a baseline with accounts for logging, auditing, and security, and applying guardrails (based on SCPs) for governance.
- Organizational Units (OUs): Allow you to group accounts to apply policies hierarchically. A common structure is to have OUs for Infrastructure, Security, Workloads (separated by Prod, Dev, Test), etc.
- Service Control Policies (SCPs): Service control policies that define the maximum permissions for accounts in an OU. They do not grant permissions but act as a filter, restricting what account administrators can delegate to users and roles. For example, an SCP can deny access to specific regions or prevent the disabling of security services like CloudTrail and GuardDuty.
2. Advanced Networking
- Transit Gateway: Acts as a central network hub to connect VPCs and on-premises networks. Simplifies network architecture, eliminates the need for complex VPC Peering, and enables centralized routing.
- PrivateLink: Provides private connectivity between VPCs, AWS services, and your on-premises networks without exposing traffic to the public internet. Creates an interface endpoint in your VPC that connects to an endpoint service.
- Direct Connect: A dedicated, private network connection between your data center and AWS. Offers consistent bandwidth and low latency. Direct Connect Gateway allows connecting to multiple VPCs in different regions with a single connection.
- Multi-Region DNS with Route 53: Essential for high availability and disaster recovery. Use routing policies like Failover (for active-passive DR), Latency-based (for global performance), and Geolocation (for data compliance) to direct traffic to the most appropriate region.
3. Advanced IAM
- Permission Boundaries: An IAM managed policy that defines the maximum permissions an identity policy can grant to an entity (user or role). Useful for delegating role creation to developers, ensuring they cannot create roles with more permissions than the defined limit.
- Attribute-Based Access Control (ABAC): An authorization strategy that defines permissions based on attributes (tags) attached to users and resources. It is more scalable than RBAC (Role-Based Access Control) in environments with many resources and policies, as you can create a single policy that applies to all resources with a given tag.
- Cross-Account Roles: Allow users or services in one account (the trusting account) to assume a role in another account (the trusted account) to access resources. It is the standard and secure way to grant cross-account access.
- IAM Roles Anywhere: Allows workloads running outside AWS (such as in your on-premises data centers) to use X.509 certificates to obtain temporary security credentials and access AWS resources. Extends IAM security outside the cloud.
4. Observability and Continuous Security
- CloudWatch: The central service for monitoring metrics and logs. Use CloudWatch Logs Insights for powerful log analysis and CloudWatch Alarms for automation and alerts.
- CloudTrail: Audits all API calls. Essential for security and compliance. Send logs to S3 for long-term retention and to CloudWatch Logs for real-time analysis.
- AWS Config: Monitors resource configuration compliance. Use Conformance Packs to deploy a set of compliance rules across your organization.
- Security Hub: Aggregates security findings from GuardDuty, Inspector, Macie, and other services, providing a unified view of your security posture.
- GuardDuty: Machine learning-based threat detection. Monitors CloudTrail logs, VPC Flow Logs, and DNS to identify malicious activity.
- KMS Multi-Region Keys: KMS keys that can be replicated to multiple regions. Simplify encryption management for data that is replicated across regions, ensuring you can decrypt data in the DR region.
5. Migration and Modernization (7Rs)
The 7 common migration and modernization strategies:
- Retire: Decommission applications that are no longer needed.
- Retain: Keep applications on-premises that are not ready to migrate.
- Rehost (Lift-and-Shift): Migrate applications to the cloud without modifications. Fast, but with fewer cloud benefits. (Ex: use Application Migration Service - MGN).
- Replatform (Lift-and-Tinker): Migrate with some optimizations, such as moving a database to RDS.
- Repurchase: Move to a different product, usually a SaaS solution.
- Refactor/Re-architect: Modify the application to leverage cloud-native features (e.g., microservices, serverless). Greater benefit, but greater effort.
- Relocate: Move infrastructure from one cloud environment to another (e.g., VMware Cloud on AWS).
- AWS Migration Acceleration Program (MAP): A program that provides consulting, tools, and service credits to accelerate migration to AWS.
6. Relevant New Features
- VPC Lattice: Simplifies connectivity, security, and observability between services in different VPCs and accounts, without the complexity of managing Transit Gateways or VPC Peering for each service.
- CloudFront Signed URLs/Cookies: Mechanisms to control access to private content on CloudFront. Signed URLs are for accessing individual files, while signed cookies can grant access to multiple files.
- CloudWatch Evidently: Allows you to perform experiments and feature launches (feature flags) to test new functionalities with a subset of users and measure the impact on performance and business.
Practical Scenarios and Example Questions
Scenario 1: Cost Optimization for Analytics Workload
- Situation: A company runs an Amazon EMR data analytics cluster every night. The job takes about 4 hours to complete and can be restarted in case of failure. The cluster cost is high.
- Architectural Decision: Use Spot Instances for the EMR cluster’s task nodes. Since task nodes do not store data, an interruption of a Spot instance can be tolerated. This can significantly reduce compute costs. For master and core nodes, use On-Demand or Reserved Instances to ensure cluster stability.
Scenario 2: Data Access Security in Multiple Accounts
- Situation: An application in a VPC in Account A needs to securely access an S3 bucket in Account B, without exposing traffic to the internet.
- Architectural Decision: Create a VPC Gateway Endpoint for S3 in Account A’s VPC. In Account B, create an S3 bucket policy that allows access from Account A’s VPC Endpoint. This ensures that traffic between the VPC and S3 remains within the AWS network.
Scenario 3: High Availability for Global Web Application
- Situation: A web application with users worldwide needs low latency and high availability. The application has an API backend and static content.
- Architectural Decision: Use Amazon CloudFront to deliver static and dynamic content. Configure CloudFront with multiple origins in different regions. Use Amazon Route 53 with a latency-based routing policy to direct users to the closest region. Implement Route 53 health checks for automatic failover in case of a region failure.
Scenario 4: Large-Scale Security Governance
- Situation: A company with dozens of AWS accounts wants to ensure that no one can disable AWS CloudTrail or create IAM access keys for the root user.
- Architectural Decision: Use AWS Organizations to manage all accounts. Create a Service Control Policy (SCP) and attach it to the organization root or a specific OU. The SCP should contain an explicit
Deny
statement forcloudtrail:StopLogging
,cloudtrail:DeleteTrail
, andiam:CreateAccessKey
actions when the principal is the root user. This enforces the policy across all affected accounts, even for administrators of those accounts.