With great compassion for our fellow engineers and end-users, we manage the balance between speed of innovation and reliability of end-user experience.
Expectations / Description
Key Results / Outcomes
- Reduce Toil
- Automate repetitive tasks
 
 - Engineering 
- Create features / enhancements / bug fixes to improve reliability, p operability of the system
 
 - Manage and Improve Incident Response
- Help teams identify alerts that can lead to early remediation of potential issues while keeping noise minimal
 - Reduce downtime
 - Reduce (Mean Time to Recovery) MTtR
 
 - Enable Observability of Operational Systems
- Establish goals for latency, traffic, errors, and saturation
 - Provide feedback on these metrics to value-stream teams
 - Help values-stream teams think about how their daily work impacts system reliability
 
 - Keep the lights on / Necessary toil activity
- At the end of the day some toil is necessary to keep things running
 - Attend to incidents
 - Monitor, debug and enhance release process
 - Check backups
 
 
Requirements
- Site Reliability Engineer's must have a position with the reach of 2.2 or higher
 - --No abilities are required--
 
Configuration Health
- 🛑 Has no Abilities
 - ✅ Is a part of 2 Positions
 - ✅ Has been referenced in 2 pieces of public recognition
 - ℹ️ No one has reacted to this Assignment
 - ℹ️ No one has an official rating on this Assignment
 
- ⛔️ Last updated: about 5 years ago
 - ℹ️ Never conversed about
 
Examples / Observations
Observation created almost 5 years agoStephen publicly highlighted the work that Brittany and Matt B. did to keep queue size down when reindexing all Lessons: https://lessonly.slack.com/archives/G8Q5B0EVA/p1607543001201000 This shout-out not only encourages those on the receiving end to continue being good stewards of our systems, but gives all developers an example to emulate. I'm sure not every engineer would have considered the impact of enqueuing any number of jobs—perhaps now they will, and might even remember that keeping queue depth to within a few hundred is a good thing, as Stephen pointed out.
Observation created over 6 years agoFeaturing:Raphael A.Stephen G.We don't wait to be told what to do, we take initiativeBack-end engineerSite reliability engineerhttps://lessonly.slack.com/archives/G8Q5B0EVA/p1560261306012400
That slack thread tells the entire story...
1) See a problem
2) MEASURE the problem in a quantitative way so that you know if you've fixed it
3) Plan a fix for said problem BEFORE it becomes an issue that stakeholders are even aware of
4) Work collaboratively on the problem without interrupting other priorities
5) look at the measures from step 2
6) Celebrate togetherLove nearly everything about this.
Way to go y'all!
Official Site Reliability Engineers
This section is for Lessonly folks only. Sign your team up to find your Gruuv!
Teams needing a Site Reliability Engineer
This section is for Lessonly folks only. Sign your team up to find your Gruuv!
Positions that reference being a Site Reliability Engineer
This section is for Lessonly folks only. Sign your team up to find your Gruuv!
Conversations about Site reliability engineer
This section is for Lessonly folks only. Sign your team up to find your Gruuv!