awesome-sre
A curated list of Site Reliability and Production Engineering resources.
Top Related Projects
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
A curated list of tools for incident response
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
A curated list of amazingly awesome open source sysadmin resources inspired by Awesome PHP.
A curated list of Chaos Engineering resources.
Quick Overview
The "awesome-sre" repository is a curated list of Site Reliability Engineering (SRE) resources. It serves as a comprehensive collection of articles, books, videos, tools, and other materials related to SRE practices, principles, and methodologies. This repository aims to be a valuable reference for both beginners and experienced professionals in the field of SRE.
Pros
- Extensive collection of SRE resources covering various topics and skill levels
- Regularly updated with new content and contributions from the community
- Well-organized structure, making it easy to find specific information
- Includes both theoretical resources and practical tools for SRE implementation
Cons
- May be overwhelming for beginners due to the large amount of information
- Some links may become outdated over time if not regularly maintained
- Lacks detailed explanations or summaries for each resource
- May not cover all emerging trends or cutting-edge practices in real-time
Getting Started
To use the awesome-sre repository:
- Visit the GitHub repository: https://github.com/dastergon/awesome-sre
- Browse through the table of contents to find topics of interest
- Click on the links to access the resources
- Consider starring the repository to stay updated with new additions
- If you want to contribute, follow the contribution guidelines in the README
Note: This is not a code library, so there are no code examples or specific installation instructions. The repository serves as a reference and collection of resources for SRE professionals and enthusiasts.
Competitor Comparisons
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
Pros of howtheysre
- Focuses specifically on real-world SRE practices from various companies
- Provides detailed case studies and implementation examples
- Regularly updated with new company insights
Cons of howtheysre
- More limited in scope compared to awesome-sre's comprehensive resource list
- Less variety in content types (primarily focuses on company-specific practices)
- May not cover as many general SRE concepts and tools
Code comparison
Not applicable for these repositories, as they primarily consist of curated lists and documentation rather than code samples.
Summary
howtheysre offers in-depth insights into how specific companies implement SRE practices, making it valuable for those seeking real-world examples. It's regularly updated but has a narrower focus compared to awesome-sre.
awesome-sre provides a more comprehensive list of SRE resources, tools, and concepts, making it a better starting point for those looking to explore the field broadly. However, it may not offer as much detail on company-specific implementations.
Both repositories serve different purposes within the SRE ecosystem. howtheysre is ideal for understanding practical applications, while awesome-sre is better for discovering a wide range of SRE-related resources and tools.
A curated list of tools for incident response
Pros of awesome-incident-response
- More focused on specific incident response tools and resources
- Includes sections on memory analysis and disk forensics
- Provides a curated list of books and training courses for incident response
Cons of awesome-incident-response
- Less comprehensive coverage of general SRE practices and principles
- Fewer resources for monitoring, observability, and performance optimization
- Limited information on capacity planning and scalability
Code comparison
While both repositories are primarily curated lists of resources, they don't contain significant code samples. However, here's a comparison of their README structures:
awesome-incident-response:
# Awesome Incident Response
A curated list of tools and resources for security incident response, aimed to help security analysts and [DFIR](http://www.acronymfinder.com/Digital-Forensics%2c-Incident-Response-(DFIR).html) teams.
- [Awesome Incident Response](#awesome-incident-response)
- [Incident Response](#incident-response)
- [Disk Image Creation Tools](#disk-image-creation-tools)
- [Memory Analysis Tools](#memory-analysis-tools)
awesome-sre:
# Awesome Site Reliability Engineering
A curated list of Site Reliability and Production Engineering resources.
#### What is Site Reliability Engineering?
> "Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE
## Contents
- [Culture](#culture)
- [Education](#education)
- [Books](#books)
- [Hiring](#hiring)
Both repositories serve as valuable resources for their respective domains, with awesome-incident-response being more specialized in security incident handling, while awesome-sre covers a broader range of SRE topics.
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Pros of awesome-scalability
- Broader focus on scalability concepts beyond just SRE practices
- Includes more visual content like diagrams and infographics
- Offers a comprehensive list of system design interview resources
Cons of awesome-scalability
- Less frequently updated compared to awesome-sre
- Fewer community contributions and engagement
- More general in scope, potentially less depth in specific SRE topics
Code comparison
While both repositories are primarily curated lists without significant code content, here's a comparison of their README structures:
awesome-scalability:
## Table of Contents
- [Scalability](#scalability)
- [System Design](#system-design)
- [Distributed Systems](#distributed-systems)
awesome-sre:
## Contents
- [Culture](#culture)
- [Education](#education)
- [Books](#books)
- [Hiring](#hiring)
Both repositories use similar Markdown structures, but awesome-sre has a more detailed and SRE-specific table of contents, while awesome-scalability covers broader topics related to system scalability and design.
A curated list of amazingly awesome open source sysadmin resources inspired by Awesome PHP.
Pros of awesome-sysadmin
- Broader scope covering general system administration topics
- More extensive list of tools and resources
- Includes categories like Backups, CMDB, and IT Asset Management
Cons of awesome-sysadmin
- Less focused on modern DevOps and SRE practices
- May include outdated or less relevant tools for current industry trends
- Lacks specific sections on observability and incident management
Code comparison
While both repositories are primarily curated lists without significant code, they differ in their organization and structure:
awesome-sysadmin:
## Backups
*Backup software.*
- [Amanda](https://www.amanda.org/) - Client-server model backup tool.
- [Bacula](https://www.bacula.org) - Another Client-server model backup tool.
awesome-sre:
## Reliability
- [Availability Table](https://github.com/dastergon/availability-table) - Table of availability percentages and corresponding downtime.
- [Reliable Product](https://github.com/lyst/MakingLyst/tree/master/reliable-product) - Article series on building reliable products.
The awesome-sre repository focuses more on concepts and practices, while awesome-sysadmin emphasizes specific tools and software categories.
A curated list of Chaos Engineering resources.
Pros of awesome-chaos-engineering
- More focused and specialized content specifically for chaos engineering practices
- Includes a dedicated section on chaos engineering tools and platforms
- Provides resources for chaos engineering in specific environments (e.g., Kubernetes, cloud)
Cons of awesome-chaos-engineering
- Smaller overall collection of resources compared to awesome-sre
- Less coverage of general SRE practices and principles
- May not be as relevant for those seeking broader SRE knowledge
Code comparison
While both repositories are curated lists and don't contain actual code, we can compare their structure:
awesome-chaos-engineering:
## Table of Contents
- [Culture](#culture)
- [Books](#books)
- [Education](#education)
- [Notable Tools](#notable-tools)
awesome-sre:
## Table of Contents
- [Culture](#culture)
- [Education](#education)
- [Books](#books)
- [Hiring](#hiring)
- [Reliability](#reliability)
Both repositories use a similar structure with markdown formatting, but awesome-sre has a broader range of topics covered in its table of contents, reflecting its wider scope in the SRE field.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Awesome Site Reliability Engineering
A curated list of awesome Site Reliability and Production Engineering resources.
What is Site Reliability Engineering?
"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE
Contributing
Please take a look at the contribution guidelines first. Contributions are always welcome!
Contents
- Culture
- Education
- Books
- Hiring
- Reliability
- Monitoring & Observability & Alerting
- On-Call
- Post-Mortem
- Capacity Planning
- Service Level Agreement
- Performance
- Programming
- Misc Articles
- Real-time Messaging
- Blogs
- Newsletters
- Conferences & Meetups
- SRE Tools
- SRE Podcasts
Culture
- What is Site Reliability Engineering?
- Keys To SRE by Ben Treynor
- Google SRE Resources
- Notes from Production Engineering by Pedro Canahuati
- PostOps: Recovery from Operations
- Love DevOps? Wait 'till you meet SRE [video]
- How Google Does Planet-Scale Engineering for Planet-Scale Infra
- Site Reliability Engineering at Facebook
- A History of Site Reliability Engineering at Uber
- Case Study: Adopting SRE Principles at StackOverflow
- Site Reliability Engineering at Dropbox
- Site Reliability Engineers â Keeping Google up and running 24/7
- Site Reliability Engineering at Salesforce
- From Sys Admin to Netflix SRE - video and slides
- SRE@Google: Thousands of DevOps Since 2004
- Transactional System Administration Is Killing Us and Must be Stopped
- A hierarchy of SRE needs
- PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
- SRE: An incomplete guide to cultural Narnia - [Video]
- Putting Together Great SRE Teams
- Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air
- Toil: A Word Every Engineer Should Know
- Engineering Reliability into Web Sites: Google SRE
- DEVOPS & SRE AMA - Building High Performance Organizations
- John Allspaw's AMA on Incident Analysis and Postmortems
- Site Reliability Engineering with Paul Newson - Part 1 & Part 2
- How SysAdmins Devalue Themselves
- The Softer Side of DevOps
- SRE, noun. See also: confidence, trust.
- Site Reliability Engineering with Stephen Weinberg
- We are the Google Site Reliability team. We make Googleâs websites work. Ask us Anything!
- We are the Google Site Reliability Engineering team. Ask us Anything!
- The Ops Identity Crisis
- The Irreproducibility Of Bugs In Large-Scale Production Systems
- SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
- Microservices, DevOps and Production Complexity
- Introducing Google Customer Reliability Engineering
- Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)
- The difference between Site Reliability Engineering, System Administration, and DevOps
- SRE in the Small and in the Large
- SBSRE Meetup: Different SRE roles and challenges(Netflix)
- Panel: Who/What Is SRE?
- Hope Is Not a Strategy
- Tenets of SRE
- Site Reliability Engineering Demystified
- Is Site Reliability Engineering the True âOpsâ in DevOps?
- SRE vs. DevOps vs. Cloud Native: The Server Cage Match
- SRE: Whatâs The Big Idea?
- Building the SRE Culture at LinkedIn
- Podcast #111 â SRE: Occasionally Maintaining Infrastructure That You Hate
- Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
- Why should your app get SRE support? - CRE life lessons
- How SREs find the landmines in a service - CRE life lessons
- Making the most of an SRE service takeover - CRE life lessons
- The Cloudcast #301: SRE and Infrastructure Operations (Podcast)
- The SRE model
- Onboarding New Site Reliability Engineers
- Building Blocks for Site Reliability At Google
- Beyond Google SRE: What is Site Reliability Engineering like at Medium?
- Intelligent Site Reliability Engineering â A Machine Learning Perspective
- A crash course in LinkedIn's global site operations
- Googleâs Site Reliability Engineering with Todd Underwood
- What is Site Reliability Engineering? (VMware)
- A Gentle Introduction to SRE
- Understanding Site Reliability Engineering through Movies and Books
- GOTO 2017 ⢠Site Reliability Engineering at Google ⢠Christof Leng
- The Makeup of Successful Geographically-Distributed SRE Teams - Part1 & Part2
- Tech Leadership in SRE
- The Azure Podcast: Episode 227 - Azure SRE
- The human scalability of "DevOps"
- Podcast: Site Reliability Management with Mike Hiraga
- How a cat inspired system reliability at Knowlarity
- Getting Started with Site Reliability Engineering
- "Practical Applications of the Dickerson Pyramid" by Nat Welch
- LinkedInâs Kurt Andersen Uncovers Blindspots in SRE Implementations
- Interview with Betsy Beyer, Stephen Thorne of Google
- Less Risk Through Greater Humanity - Dave Rensin
- Getting Started with SRE - Stephen Thorne, Google
- Building Successful SRE in Large Enterprises
- Solving Reliability Fears with Site Reliability Engineering
- SRE vs. DevOps: competing standards or close friends?
- How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams
- Reliability Engineering â The Essential Discipline for Complex Systems
- The Modern Site Reliability Workbench on Top of OCI
- SRE in the Third Age
- About SRE and how (not) to apply it
- Transitioning a typical engineering ops team into an SRE powerhouse
- Making a Lion Bulletproof: SRE in Banking
- Identifying and tracking toil using SRE principles
- From Ops to SRE: Evolution of the OpenShift Dedicated Team
- Meeting reliability challenges with SRE principles
- A quick introduction to SRE principles
- The SRE I Aspire to Be
- Taming Operational Load with VMware CRE
- SRE Cultural Values
- Are we there yet? Thoughts on assessing an SRE teamâs maturity
- What SREs have to do with project-based services?
- Making operational work more visible
- SRE vs. DevOps: Whatâs the Difference Between Them?
Education
- Panel: Educating SRE
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- New to an SRE team?
- The Systems Engineering Side of Site Reliability Engineering
- Graduating from Bootcamp and interested in becoming a Site Reliability Engineer?
- So you want to be a Site Reliability Engineer?
- Spiraling Ops Debt & the SRE Coding Imperative
- So you want to be an SRE?
- Career Profiles/Site Reliability Engineer
- What is the role of a Site Reliability Engineer?
- Lynda.com: DevOps Foundations: Site Reliability Engineering
- Incident Management Training: Wheel of Misfortune
- Site Un-Reliability Engineering [Video Series]
- The Ultimate Guide to Structuring a 90-Day Onboarding Plan
- SRE fundamentals: SLIs, SLAs and SLOs
- How to Get Into SRE
- Do you have an SRE team yet? How to start and assess your journey
- How SRE teams are organized, and how to get started
- Why SRE Documents Matter
- How to get started with site reliability engineering (SRE)
- Duties of a Site Reliability Engineering Manager
- Designing distributed systems using NALSD flashcards
- Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program
- SRE Classroom: Distributed PubSub workshop
- School of SRE: Curriculum for onboarding non-traditional hires and new grads
Books
- Practical Linux Infrastructure
- Site Reliability Engineering: How Google Runs Production Systems
- The Site Reliability Workbook: Practical Ways to Implement SRE
- Observability Engineering: Achieving Production Excellence
- The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
- Web Operations - Keeping the Data On Time
- The Checklist Manifesto: How to Get Things Right
- Microservices in Production - Standard Principles and Requirements
- Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization
- Systems Performance: Enterprise and the Cloud [Sample chapter titled CPUs
- Monitoring Distributed Systems: Case Studies from Google's SRE Teams
- The Human Side of Postmortems: Managing Stress and Cognitive Biases
- Chaos Engineering: Building Confidence in System Behavior through Experiment
- Post-Incident Reviews: Learning from Failure for Improved Incident Responses
- Antifragile Systems and Teams
- How to Monitoring the SRE Golden Signals (E-Book)
- Incident Management for Operations
- Real-World SRE
- Seeking SRE
- What is SRE?
- Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications
- Building Secure and Reliable Systems
- Chaos Engineering: Crash test your applications
- 97 Things Every SRE Should Know
- Four Steps to Creating Effective Game Day Tests
- The Linux Programming Interface
Hiring
- SRE Hiring
- Hiring SREs at LinkedIn
- Hiring Site Reliability Engineers
- Hiring your first SRE
- Growing the Site Reliability Team at LinkedIn: Hiring is Hard
- Engineering Manager - Site Reliability Engineering Interview Preparation
Reliability
- The Realities of the Job of Delivering Reliability
- Fail at Scale by Ben Maurer
- Embracing Failure: Fault-Injection and Service Reliability
- 10 Years of Crashing Google
- How we break things at Twitter: failure testing
- Reliable Cron across the Planet
- Push our limits - reliability testing at Twitter
- The Verification of a Distributed System by Caitie McCaffrey
- Weathering the Unexpected
- SRE Hour: Tech Talks by Box & Yelp
- Simplicity: A Prerequisite for Reliability
- The Two Sides to Google Infrastructure for Everyone Else
- How Embracing Continuous Release Reduced Change Complexity
- Making "Push On Green" a Reality
- BeyondCorp: A New Approach to Enterprise Security
- Brainstorming Failure by Jeff Smith
- The Ripple Effect Of Outages And Downtime Cannot Be Underestimated
- The infrastructure behind Twitter: efficiency and optimization
- Dickerson's Hierarchy of Reliability
- The Morning Paper on Operability
- Production is all that matters
- Using load shedding to survive a success disaster - CRE life lessons
- How to avoid a self-inflicted DDoS Attack - CRE life lessons
- Don't gamble when it comes to reliability
- Resilience Engineering: Learning to Embrace Failure
- The Infrastructure Behind Twitter: Scale
- Scaling Reliability at Twitter: So You Want to Add a 9
- Principles Of Chaos Engineering
- Chaos Engineering
- Available...or not? That is the question - CRE life lessons
- How Google Backs Up The Internet Along With Exabytes Of Other Data
- Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements
- The Production Environment at Google - Part 1 & Part 2
- Reliable releases and rollbacks - CRE life lessons
- How release canaries can save your bacon - CRE life lessons
- Things I Learned Managing Site Reliability for Some of the Worldâs Busiest Gambling Sites
- Every Day Is Monday in Operations
- Under the Hood: Ensuring Site Reliability
- Designing reliable systems with cloud infrastructure (Google Cloud Next '17)
- A Google SRE explores GitHub reliability with BigQuery
- Know thy enemy: how to prioritize and communicate risks - CRE life lessons
- Chaos Engineering resources
- CRE life lessons: What is a dark launch, and what does it do for me?
- Why you should pick strong consistency, whenever possible
- The Network is Reliable
- Are You Load Balancing Wrong?
- How production engineers support global events on Facebook
- Google: A Collection Of Best Practices For Production Services
- Canary Analysis Service
- Tips for High Availability
- Progressive Service Architecture At Auth0
- Google Cloud Production Guideline
- production readiness
- Trust By Design: The Fusion of Operational Maturity and Risk Modeling
- Top Seven Myths of Robust Systems
- Taming chaos: Preparing for your next incident
- PID Loops and the Art of Keeping Systems Stable
- Are you ready for production? - Slides
- Production Checklist for Web Apps on Kubernetes
- Finding a problem at the bottom of the Google stack
- Rethinking Task Size in SRE
- How maintenance windows affect your error budget
- The Production Readiness Spectrum
- Generic mitigations
- How weâre building a production readiness review process at Grafana Labs
- Resiliency Planning for High-Traffic Events
- Using Fault Injection Testing to Improve DoorDash Reliability
Monitoring & Observability & Alerting
- A Working Theory-of-Monitoring
- The Evolution of Monitoring Systems at Google - Tony Rippy
- Monitoring without Infrastructure @ Airbnb
- Monitoring distributed systems
- Observability at Uber Engineering: Past, Present, Future
- The 4 Golden Signals of API Health and Performance in Cloud-Native Applications
- My Philosophy on Alerting by Rob Ewaschuk
- Time To Detect - Netflix
- Why Percentiles Donât Work the Way you Think
- Building Twitterâs Next-Gen Alerting System
- Instrumentation: Worst case performance matters
- Instrumentation: What does 'uptime' mean?
- Incidents + Outages at CircleCI: Our Playbook and What Weâve Learned
- An introduction to monitoring and alerting with timeseries at scale, with Prometheus
- Detecting outliers and anomalies in realtime at Datadog
- How to Monitor the SRE Golden Signals
- Monitoring in a DevOps World
- Monitoring Your Monitoringâs Monitoring
- Observability: the new wave or buzzword?
- Monitoring Isn't Observability
- Monitoring in the time of Cloud Native
- Principles of Monitoring Microservices
- The Many Ways Your Monitoring Is Lying to You
- GitOps Part 3 - Observability
- Want to Debug Latency?
- Debugging Latency in Go 1.11
- Alerting on SLOs like Pros
- Applied Alerting Philosophy
- Observations on Observability
- Deploys: It's Not Actually About Fridays
- Site Reliability Engineering Best Practices for Data Pipelines
- Elastic Observability in SRE and Incident Response
- Error Budget Policy - Part 1 - Adoption at Expedia Group
- Error Budget Policy - Part 2 - Practices at Expedia Group
On-Call
- Being an On-Call Engineer: A Google SRE Perspective
- Inside Atlassian: how our site reliability engineers do incident management
- Inside Atlassian: how IT & SRE use ChatOps to run incident management
- Incident Response at Heroku
- Who's On Call?
- SysAdvent - Day 6 - No More On-Call Martyrs
- On Being On Call
- The On-Call Handbook
- Incident management at Google â adventures in SRE-land
- Run Book / Operations Manual template
- Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
- Project STAR*: Streamlining Our On-Call Process
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- How To Establish a High Severity Incident Management Program
- How Your Systems Keep Running Day After Day - John Allspaw
- On-call doesnât have to suck
- Why, as a Netflix infrastructure manager, am I on call?
- Oncall and Sustainable Software Development
- On Call Rotations: How Best to Wake Devs Up in the Middle of the Night
- Understanding The Role Of The Incident Manager On-Call (IMOC)
- 3 Ways to Minimize the Impact of High Severity Incidents
- Advice to Management Teams While Enrolling Changes to On-Call Systems
- Moving Past Shallow Incident Data
- Sustainable On-Call
- dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident
- Incident Management at Netflix Velocity
- Incidents, fixes, and the day after
- 10 Steps to Develop an Incident Response Plan Youâll ACTUALLY Use
- Checklists: a stupidly simple but valuable operational gift
- How to write a status page update
- Atlassian Incident Handbook
- PagerDuty Incident Response Handbook
- Avoiding Burnout for SREs
- Better On-Call the SRE way
- Managing Incidents at Monzo
- Making On-Call Not Suck
- How we (Monzo) respond to incidents
- How weâve evolved on-call at Monzo
- Code Yellow: When Operations Isnât Perfect
- MTTR is dead, long live CIRT
- Extended Dreyfus Model for Incident Lifecycles
- Inhumanity of Root Cause Analysis
- Incident insights from NASA, NTSB, and the CDC
- How to avoid On-Call Burnout the SRE Way
- My week shadowing a GitLab Site Reliability Engineer
- How our production team runs the weekly on-call handover
- Writing Runbook Documentation When Youâre An SRE
- Incident response, programs and you(r startup)
- An Incident Command Training Handbook
- Shrinking the time to mitigate production incidents
- Incident writeup as sociological storytelling
- Elephant in the Blameless War Room: Accountability
- Naming names in incident writeups
- Building On-Call Culture at GitHub
Post-Mortem
- A collection of post-mortems
- Collection of Kubernetes Failure Stories
- Blameless PostMortems and a Just Culture
- A Tale of Postmortems
- Building a Blameless Post-Mortem Culture with Jason Hand
- The infinite hows
- Failure is Always An Option: How a Blameless Culture Leads to Better Results
- SysAdvent - Day 1 - Why You Need a Postmortem Process
- Etsyâs Debriefing Facilitation Guide for Blameless Postmortems
- Writing Your First Postmortem
- How to Write Great Outage Post-Mortems
- A collection of postmortem templates
- Embracing Feedback
- Postmortem Action Items: Plan the Work and Work the Plan
- Social Issues In Postmortems
- Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant
- Postmortem culture: how you can learn from failure
- re:Work - Postmortem discussion template
- Post-mortems to the rescue
- Postmortem Action Items: Plan the Work and Work the Plan
- Why Every Company Can Benefit from a Blameless Culture
- "It's dead, Jim": How we write an incident postmortem
- Our incident postmortem template
- Learn out of mistakes. Postmortems to the rescue.
- Improving Postmortem Practices with Veteran Google SRE, Steve McGhee
- Inhumanity of Root Cause Analysis
Capacity Planning
- Capacity Planning
- SouthBay SRE: Cloud Capacity Planning
- Intent-based Capacity Planning and Autoscaling with Kubernetes
- How do you do Capacity Planning
- How Back Market SREs prepared for Black Friday
Service Level Agreement
- If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues
- Service Level Agreements in the Cloud: Who cares?
- SysAdvent- Day 20 - How to set and monitor SLAs
- SLOs, SLIs, SLAs, oh my - CRE life lessons
- Service Levels and Error Budgets
- (Un)Reliability Budgets - Finding Balance between Innovation and Reliability
- The Calculus of Service Availability
- Availability Calculator: Calculate how much downtime should be permitted in your SLA
- Standardize cloud SLA availability with numerical performance data
- Best practices to develop SLAs for cloud computing
- A Practical Guide to SLAs
- Building good SLOs - CRE life lessons
- No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
- Consequences of SLO violations â CRE life lessons
- Service Level Objectives in Practice
- SRE Consensus Building
- An example escalation policy â CRE life lessons
- Error Budget Calculator
- Understanding error budget overspend - part one - CRE life lessons
- Good housekeeping for error budgets - part two - CRE life lessons
- SRE fundamentals: SLIs, SLAs and SLOs
- SLOs & You: A Guide To Service Level Objectives
- Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment
- Nines are Not Enough: Meaningful Metrics for Clouds
- How many nines is my storage system?
- Don't follow the sun.
- The Tyranny of the SLA
- Backblaze Durability is 99.999999999% â And Why It Doesnât Matter
- DevOpsDays Chicago 2019 - The Art of SLOs
- The Art of SLOs Workshop Materials
- How to Include Latency in SLO-Based Alerting
- Succeeding With Service Level Objectives
- Putting customers first with SLIs and SLOs
- SRE Leadership: Have Tiered SLAs
- How SLOs Enable Fast, Reliable Application Delivery
- The Tail at Scale
- The Tail at Scale Revisited
- Defining SLOs for services with dependencies
- Service Level Disagreements
- How We Use Sloth to do SLO Monitoring and Alerting with Prometheus
- SLI Deep Dive
- Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox
- SLO tracker
- SLO Alerting for Mortals
- SRE methods and climate change
- What made SLOs so messy (and what we can do about it)
- SLICK: Adopting SLOs for improved reliability
- Calculating composite SLA
- Best practices for setting SLOs and SLIs for modern, complex systems
Performance
- Performance Checklists for SREs
- South Bay SRE Meetup - Netflix Cloud Performance Team
- Software Performance Analysis Guided By SLOs
- A framework for pragmatic performance engineering
Programming
- Go Language for Ops and Site Reliability Engineering
- Go for SREs using Python
- Operability in Go
- Go Reliability and Durability at Dropbox
Misc Articles
- What is SRE (Site Reliability Engineering)?
- Hereâs How Google Makes Sure It (Almost) Never Goes Down
- Are site reliability engineers the next data scientists?
- Site Reliability Engineers: "solving the most interesting problems"
- Site Reliability Engineers: the "worldâs most intense pit crew"
- Site reliability engineering kicks rote tasks out of IT ops
- Notes on Site Reliability Engineering
- Adventures in SRE-land: Welcome to Google Mission Control
- Book Review: Site Reliability Engineering - How Google Runs Production Systems
- Site Reliability Engineers: âWe solve cooler problemsâ
- SREcon17: Brave new world of site reliability engineering
- Open AWS guide
- Commentary on Site Reliability Engineering
- Site Reliability Engineering: 4 Things to Know
- Looking for SRE Success? Then Find the Intrapreneurs!
- What Team Structure is Right for DevOps to Flourish?
- Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency
- Building blameless working environment
- SRE Adoption Report
- SREs: The Happiest â and Highest Paid â in the Industry
- The Role of Site Reliability Engineering, Today and Tomorrow
- SRE as a Lifestyle Choice
- SRECon EMEA 2019 Recap
- Life of an SRE at Google - JC van Winkel
- Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa - Case study: Halodoc adaptation of SRE principles for Native Mobile Apps
- SRE Best Practices by InfraCloud
Real-time Messaging
- #sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
- #incident_response channel at Hangops Slack - Discussion about Incident Response.
- USENIX SREcon Slack
Blogs
- Brendan Gregg's Blog - Highly Technical Blog Posts About Systems Internals, Performance and SRE.
- Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
- High Scalability - Technical Blog Posts About Systems Architecture.
- rachelbythebay - Techincal Blog Posts.
- Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
- SysAdvent - One article for each day of December, ending on the 25th article.
- Stephen Thorne's Blog - Blog Posts About SRE
- Increment - A digital magazine about how teams build and operate software systems at scale.
- GopherSRE - Blog Posts about Go and SRE.
- Cindy Sridharan - Blog posts about distributed systems and their management.
- Blameless Blog - Blog posts about SRE culture and practices.
- Resilience Roundup - Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
- Squadcast Blog - Blog posts about SRE best practices, reliability, on-call and incident management.
- FireHydrant Blog - Posts about complex systems, incident response, and SRE best practices.
- Rootly Blog - Incident management best practices and guides.
- incident.io Blog - Guides, advice and resources on incident management and response.
- Logit.io Blog - Resources on log management, SRE and devOps.
Newsletters
- DevOpsLinks - A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.
- KubeWeekly - The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas
- SRE Weekly - Weekly Site Reliability Newsletter.
- OâReilly Systems Engineering and Operations Newsletter - Weekly systems engineering and operations news and insights from industry insiders.
- ChaosEngineering.news - Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox!
- Monitoring Weekly - What's new in monitoring? Curated monitoring articles to your inbox each week.
- Observability news - Updates around observability (o11y) with a special focus on open source.
Conferences & Meetups
- SRECon Conferences - The Official SRE Conference.
- LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
- SRE Tech Talks - SRE Talks Hosted by Google.
- South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
- San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
- Site Reliability Engineering Munich, Germany - SRE Meetup in the greater area of Oktoberfest city.
- ADDO - All Day DevOps - A 24 hour conference that is completely online and free.
- Site Reliability Engineering Paris, France - SRE Meetup in the city of light.
- Site Reliability Engineering India - SRE Meetup India
- Google SRE Twitter Account - Google's SRE Twitter Account.
- SREBook - The Official Twitter Account of Site Reliability Engineering Book.
- SREcon - SRECon's Official Twitter Account.
- SREWorkbook - The Official Twitter Account of Site Reliability Workbook.
- The SRE Dev - SRE-related Posts from dev.to.
- Twitter SRE - The Official Twitter Account of Twitter's SRE team.
- Twitter SRE Weekly - The Official Twitter Account of SRE Weekly Newsletter.
- USENIX Association - The Official USENIX Twitter Account.
SRE Tools
- Awesome SRE Tools - A curated list of Site Reliability and Production Engineering tools
- List of Continuous Integration services
- SRE cheat sheet - A cheat sheet for Site Reliability Engineering principles and numbers
Podcasts
Top Related Projects
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
A curated list of tools for incident response
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
A curated list of amazingly awesome open source sysadmin resources inspired by Awesome PHP.
A curated list of Chaos Engineering resources.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot