AWS Architecture Blog

Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.


In the last blog, Maximizing System Throughput, we talked about design patterns you can adopt to address immediate scaling challenges to provide a better customer experience. In this blog, we talk about architecture patterns to improve system resiliency, why observability matters, and how to build a holistic observability solution.

As a refresher from previous blogs, our example ecommerce company’s “Shoppers” application runs in the cloud. It is a monolithic application (application server and web server) that runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance. It connects with a PostgreSQL database running on Amazon Relational Database Service (Amazon RDS).

The monolith application is tightly coupled with the database. The transaction from front end to database has to complete within 30 milliseconds, otherwise user experience degrades. The application integrates with a number of external systems for enriching data, payment processing, and order fulfillment. Some of these external systems don’t provide Service License Agreements (SLAs) on response time but have median response time of 15 milliseconds. It takes around 30 minutes before a new server starts serving user traffic. To break it down, it takes 10 minutes for application startup, 5 minutes to create and test connections, 10 minutes to run sanity test, and 5 minutes to load application cache.

Increase resiliency

The complexity of monolith applications can present unknown failures. You cannot avoid failure by simply testing all application and infrastructure failure scenarios. Your application must be designed to handle any cascading failures to maintain a continuous user experience even when the backend systems are experiencing issues. In the following sections, we show you the steps we took to improve system resiliency for our example company.

Minimum business continuity for failover

Building disaster recovery (DR) strategies into your system requires you to work backwards from recovery point objective (RPO) and recovery time objective (RTO) requirements. Our business needs in this scenario required us to build high availability to prevent 30 minutes of continuous downtime (RTO) and prevent persistent user data loss (that is, a few minutes RPO).

To meet these goals, we developed a long-term business continuity plan per the Disaster Recovery of Workloads on AWS: Recovery in the Cloud whitepaper that consisted of the following:

Due to the stateful nature of the application, taking EBS snapshots wasn’t going to prevent data loss. We updated our continuous integration and continuous delivery (CI/CD) pipeline. This enables us to deploy application code in different Regions based on the guidance from the Using AWS CodePipeline to Perform Multi-Region Deployments blog.

Improve load-balancing capability across servers

The monolith application handles all transactions, including long transactions that can take up to 3 minutes and short transactions that complete within 30 milliseconds.

To manage the resulting unbalanced traffic, we used the Least Outstanding Request algorithm in Application Load Balancer that distributes the load more uniformly.

Exponential backoff

After implementing retries and timeouts, we learned that backoffs are equally important. This is true especially for exponential backoff, where wait time increases exponentially after every retry attempt. We know from the Timeouts, retries, and backoff with jitter article that retries can become an anti-pattern for application resiliency.

We observed that database retries were overwhelming the database in the case of latency jitters. So, we implemented retry throttling within AWS SDK for Java per the Introducing Retry Throttling blog.

Decoupling integrations using event-driven design patterns 

Our application experienced a cascading failure when the payment processing and order fulfillment API time out impacted business operations.

To fix this, we identified asynchronous messaging paths to improve customer experience. We updated our application logic to use the design pattern shown in this FIFO topics example use case:

This pattern allows us to retry failed requests while using First-In-First-Out and deduplication features described in the Introducing Amazon SNS FIFO – First-In-First-Out Pub/Sub Messaging blog.

Predictive scaling for EC2

Due to its monolithic architecture, the application didn’t scale quickly with sudden increases in traffic because of its high bootstrap time.

To predict future demands, we optimized for application availability and allowed automatic scaling to use historical CPU utilization data. This approach was highlighted in the New – Predictive Scaling for EC2, Powered by Machine Learning blog. This allows us to adjust capacity needs by forecasting usage patterns along with configurable warm-up time for application bootstrap.

Standardize observability

Production outages are scary for everyone, but with the right system monitoring solution, they can be made less stressful. After few outages of our application, we realized we needed to re-think holistically and not add metrics on one-time basis.

Logging into each server and comparing logs can get overwhelming when you are troubleshooting to identify bottleneck for unknown behaviors. But we found the key to troubleshooting: observability and event correlation, as explained in the following sections.

Define application and infrastructure metrics

As part of standardizing monitoring and observability metrics, we followed the guidance provided in the Performance Efficiency pillar of the AWS Well-Architected Framework. In summary, we worked backwards from the customer experience to identify performance-related metrics for each system component and dashboards to provide consolidated views.

Centralized logging

When analyzing downtime events, we found that the time spent on correlating different events was leading to high Mean-Time-To-Resolve. Additionally, the performance baseline was missing. Our analysis also uncovered a gap in security monitoring.

To overcome these challenges, we:

We used this solution as a baseline and updated the configuration as shown in Amazon ES Service Best Practices to optimize for performance and scale. We took manual hourly snapshots of the ES cluster to Amazon S3 and used Amazon S3 Cross-Region Replication to restore the cluster in case of a DR event.

Manage service limits with Service Quota and adopt Multi-account strategy

We observed that most workloads were running in a single account, leading to service limits. We implemented two strategies to address this situation:

Figure 1. Current Architecture with improved resiliency and standardized observability

Figure 1. Current Architecture with improved resiliency and standardized observability

Conclusion

In this blog, we talked about design patterns you can adopt to improve overall resiliency, meet your RPO and RTO objectives, scale monolith application based on forecasted usage patterns, enhance observability, and plan for multi-account framework. In the next blog, we will talk about more architecture patterns to evolve the current architecture.

Find out more

Other blogs in this series

Related information

Anuj Gupta

Anuj Gupta

Anuj Gupta is a Principal Solutions Architect working with digital-native business customers on their cloud-native journey. He is passionate about using technology to solve challenging problems and has worked with customers to build highly distributed and low-latency applications. He also contributes to open-source solutions. Outside of work, he loves traveling with his family and meeting new people.

Neeraj Kumar

Neeraj Kumar

Neeraj Kumar is a Principal Solutions Architect at AWS. He has 20+ year experience in IT, including Cloud Computing, Enterprise Architecture, IT Strategy & Transformation. Prior to joining AWS in Jan 2016, he worked across multiple industry verticals - Oil & Gas, Consumer Tech & Consulting. He is currently based out of San Francisco Bay Area. Neeraj’s passion is solving business problems on customer’s behalf using distributed, cloud-native and well-architected design patterns.