Site Reliability and Cloud Operations

Helping teams run reliable things in production without losing sleep

I work at the intersection of infrastructure, observability, and customer impact. This site highlights a few practical examples of how I think about reliability and operate cloud systems.

Kubernetes and containers AWS and multi cloud Observability and metrics Incident response

Content is a work in progress. All examples here will be based on real systems and hands on work.

Case studies

Selected reliability stories

These will each become one page summaries that follow a simple pattern. A short problem statement, what I did, and the outcome. For now these are placeholders and titles only.

Case study

Stabilizing a noisy production workload

Placeholder for a concrete story about taking a flapping or noisy production system and making it predictable. This could be about tuning autoscaling, fixing metrics, or simplifying a deployment path.

Later this will link to a dedicated page with a short before and after picture plus the way I approached the work.

AWS Kubernetes Autoscaling

Case study

Making on call easier to live with

Placeholder for a story about improving alerts, dashboards, and runbooks so that incidents are less chaotic and more repeatable. Focus will be on small changes that provided a big improvement in signal.

Later this will include one or two sample Prometheus or logging snippets to show how I think about observability.

Prometheus Grafana Runbooks

Projects

Hands on examples with code

These are meant to be focused and practical. Each project will have a short description, a link to the GitHub repository, and a quick note on what to look at in the code or infrastructure.

Project

This site as a small SRE friendly stack

A static site hosted on S3 behind CloudFront with a simple deployment pipeline. The goal is clarity and cost control, not complexity. GitHub Actions handle updates and cache invalidation.

Placeholder for a GitHub link and a short walkthrough of the repository layout and deployment flow.

S3 CloudFront GitHub Actions

Project

Controlled demo environment

Planned small demo that spins up a short lived workload in response to a button click. The focus will be on guardrails, observability, and clear teardown so that it is safe to show to potential employers.

This space will later link to the demo page and describe how the limits and cost protections are wired.

AWS Kubernetes or k3s Automation

Reliability and monitoring

Treating this site like a real service

This site is monitored the same way I would monitor a small production service. AWS Route 53 runs synthetic HTTPS health checks against the live domain and CloudWatch raises alarms that notify me when something is wrong.

Current uptime check status

Loading status...

Health checks

Continuous HTTPS uptime monitoring

A Route 53 health check calls https://chris-nelson.dev on a regular interval and publishes metrics into CloudWatch. If the endpoint stops responding or starts failing, the health check status flips to unhealthy.

Implemented as infrastructure as code using Terraform so configuration is versioned alongside the rest of the stack.

Route 53 health check HTTPS on 443 CloudWatch metrics

Alerts and SLO

Alarms wired to on call notifications

A CloudWatch metric alarm watches the health check status and sends notifications through Amazon SNS to my email when the site is unavailable. This models a realistic production loop: detect, notify, respond, and capture what happened even for a small personal project.

I track a simple 99.9 percent uptime target for this static site as a lightweight SLO practice.

CloudWatch alarm SNS notifications Uptime target

How this site works

Simple, explicit infrastructure choices

This section is here so you can see how I think about tradeoffs. The goal is a clear, low friction path to production for something small that still behaves like a real system and respects cost.

Static front end backed by S3

The site is a plain static front end. Files are stored in a private S3 bucket. CloudFront is the only way to reach it. That keeps the mental model simple and the blast radius small.

CloudFront for TLS and performance

CloudFront terminates TLS for both the root domain and the www host name. A small cache policy keeps static assets fast while still letting updates propagate when the site changes.

GitHub based deployments

A GitHub Action syncs the repository to S3 and triggers a CloudFront invalidation. Changes are versioned, reviewable, and repeatable. No manual uploads.

Cost guardrails

An AWS Budget and a dedicated IAM user keep this environment safe for experimentation. Over time this section will grow to show more of the safety rails around the demo pieces.

About

Who you are reading about

I work in the space between infrastructure, reliability, and customer experience. Most of my background is in content delivery, large scale cloud workloads, and helping teams feel more confident in the way their systems behave in production.

This page is not a replacement for a resume. It is meant to be a focused sample of how I approach reliability problems and think through technical decisions. The details here will grow over time as I add case studies and project notes.

If you are reviewing this as part of a hiring process and would like more detail on anything you see here, there is a contact section below with simple ways to reach me.