Operations, Security and Reliabilty Framework for Cloud Architecture

In my previous blog, I mentioned about system design framework for Cloud architecture. Now, in today’s zero trust and highly resilient requirement of clients, with least overhead on operational expenditure, only system design factors are not enough. Architects need to build additional strategies with security,operations and reliability in mind. So, lets finalize our design for cloud architecture with these additional points.

1. DevOps

DevOps in itself is linked more to automation in application development and delivery cycle, for supporting CI/CD cycle of software release and updates. But this thin line between cloud infra design and system devops capability is gradually blurring out, especially with introduction of kubernetes engines and serverless services.

When we link it to system design, our cloud architecture must be ready with relevant build, release tracking and deployment tools with monitoring systems in place before even application is onboarded. So what we need to ensure is,

  • Build tools like Cloud build or Codebuild, external or internal Git Repo and deployment tools like helm chart or istio service mesh
  • Monitoring tools like Cloudwatch or third party applications like Prometheus
  • Regular checks on bandwidth, latency, http errors (to avoid API call failures)
  • Infra as Code tools like Terraform and Configuration management tools like Ansible or Chef

2. Capacity Forecasting

Cloud is still considered expensive, though most models expects ROI in five years, which is actually not the case. However, cost can be optimized with proper capacity forecasting and keeping check on resources used. This can help us with better model of areas where resources can be scaled down.

Monitoring usage with broader view of peak hours, application user behaviors and region wise load seggregation are few of ways to forecast your clouds resource usage and design accordingly.We will cover more of this in Cost Optimization.

3. Defense in Depth – IAM, Firewall and Security Groups

Keeping detailed Cloud Security model out of scope of this blog, we can see overview of various cloud security aspects to consider while designing our system.

Cloud system provided by various companies have their own strategies and solutions in place, beside different third party security solution providers.

  • Firewall, either using NG virtualized third party solutions or using in build proxies, NAT gateway etc can provide first layer of defense to you cloud solution
  • Security groups provide stateful inspection for allowed Ingress and Egress traffic, providing another layer to our defense strategy
  • IAM policies, with customized roles for services and policies for users add one more layer of depth to keep checks access traffic between applications and human interactions to system

4. Data Security and Log Trails

Another vital part of zero trust security approach is ensuring data security via various encryption mechanism and keeping trail of actions performed inside your infra & applications using system logs.

Its important to remember that Cloud Security model follows shared responsibility model, where by onus lay on both users and cloud provider for playing their individual role in keeping everything secured.

  • Data encryption, in three stages viz. encryption by client before transferring to cloud, encryption in transit by cloud provider and encryption of data at rest
  • Using cloud providers log tracking services or deploying open source syslog servers can serve purpose of logging and trailing

5. Reliability with Monitoring, KPIs & Customer Support

Although this section relates closely to already covered operational aspects of cloud design, reliability is much more than keeping your applications always available and scalable.

Reliable cloud design covers system recovery measured by MTTR and MTBF, backup strategies measured by RTO, RPO and system performance measured by metrics and KPIs, all grading your system design on scale of higher SLI (Service Level Indicator) and SLO (Service Level Objective). Reliability is the responsibility of everyone in engineering, such as the development, product management, operations, and site reliability engineering (SRE) teams.

Beside aforementioned pointers, we should never forget that migrating to cloud bring with itself huge culture shift as well. You need to upgrade skill set of existing engineers, introduce a company wide culture of resource optimization to reduce cost and implement a strict cost monitoring facility. Building a highly scalable, cloud native, resilient and secure cloud architecture means you need to spend money, which can be controlled if architect and end users have awareness for various pricing models and customize their design accordingly.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s