Operations Assignment: Site reliability engineer

With great compassion for our fellow engineers and end-users, we manage the balance between speed of innovation and reliability of end-user experience.


Expectations / Description

Key Results / Outcomes

  • Reduce Toil
    • Automate repetitive tasks
  • Engineering
    • Create features / enhancements / bug fixes to improve reliability, p operability of the system
  • Manage and Improve Incident Response
    • Help teams identify alerts that can lead to early remediation of potential issues while keeping noise minimal
    • Reduce downtime
    • Reduce (Mean Time to Recovery) MTtR
  • Enable Observability of Operational Systems
    • Establish goals for latency, traffic, errors, and saturation
    • Provide feedback on these metrics to value-stream teams
    • Help values-stream teams think about how their daily work impacts system reliability
  • Keep the lights on / Necessary toil activity
    • At the end of the day some toil is necessary to keep things running
    • Attend to incidents
    • Monitor, debug and enhance release process
    • Check backups

Requirements

  • Site Reliability Engineer's must have a position with the reach of 2.2 or higher
  • --No abilities are required--

Configuration Health

  • 🛑 Has no Abilities
  • ✅ Is a part of 2 Positions
  • ✅ Has been referenced in 2 pieces of public recognition
  • ℹ️ No one has reacted to this Assignment
  • ℹ️ No one has an official rating on this Assignment
  • ⛔️ Last updated: about 5 years ago
  • ℹ️ Never conversed about

Examples / Observations

  Observation created almost 5 years ago

Stephen publicly highlighted the work that Brittany and Matt B. did to keep queue size down when reindexing all Lessons: https://lessonly.slack.com/archives/G8Q5B0EVA/p1607543001201000 This shout-out not only encourages those on the receiving end to continue being good stewards of our systems, but gives all developers an example to emulate. I'm sure not every engineer would have considered the impact of enqueuing any number of jobs—perhaps now they will, and might even remember that keeping queue depth to within a few hundred is a good thing, as Stephen pointed out.

  Observation created over 6 years ago

https://lessonly.slack.com/archives/G8Q5B0EVA/p1560261306012400

That slack thread tells the entire story...

1) See a problem
2) MEASURE the problem in a quantitative way so that you know if you've fixed it
3) Plan a fix for said problem BEFORE it becomes an issue that stakeholders are even aware of
4) Work collaboratively on the problem without interrupting other priorities
5) look at the measures from step 2
6) Celebrate together

Love nearly everything about this.

Way to go y'all!

Official Site Reliability Engineers

Manager Details:
This section is for Lessonly folks only. Sign your team up to find your Gruuv!

Teams needing a Site Reliability Engineer

This section is for Lessonly folks only. Sign your team up to find your Gruuv!

Positions that reference being a Site Reliability Engineer

This section is for Lessonly folks only. Sign your team up to find your Gruuv!

Conversations about Site reliability engineer

This section is for Lessonly folks only. Sign your team up to find your Gruuv!

Embed code

<iframe src="http://ourgruuv.com/our/roles/53?embed=true&name=site_reliability_engineer&organization=lessonly"></iframe>