Staff Site Reliability Engineer – Observability

About the Role

Title: Staff Site Reliability Engineer – Observability

Location: IL-Springfield

Fastly helps people stay better connected with the things they love. Fastly’s edge cloud platform enables customers to create great digital experiences quickly, securely, and reliably by processing, serving, and securing our customers’ applications as close to their end-users as possible — at the edge of the Internet. The platform is designed to take advantage of the modern internet, to be programmable, and to support agile software development. Fastly’s customers include many of the world’s most prominent companies, including Vimeo, Pinterest, The New York Times, and GitHub.

We’re building a more trustworthy Internet. Come join us.

Fastly’s Observability team is looking for a Staff Site Reliability Engineer who is passionate about building, scaling, and automating our internal platforms to provide global visibility to the health and performance of our networks. You will be working alongside other engineering and support teams, to provide insights and recommendations on how we make our services and software stacks more observable. Your focus in logging, metrics, distributed tracing and monitoring will be vital in this role to help Fastly grow our observability platforms.

What You’ll Do:

Focus on improving and scaling our logging pipelines, telemetry collection, and monitoring systems

Improve the performance and reliability of the observability platform infrastructure
Create and instrument critical business metrics for insights and transparency
Collaborate with other Fastly engineers to implement solutions that deliver value for our internal customer teams

You’ll participate in incident reviews to build improved alerts for detection and potential proactive mitigations

What We’re Looking For:

Extensive experience scaling out Prometheus architecture i.e. you are not just a user of Prometheus but have actually built the underlying infrastructure
Comfortable working with tools like OpenTelemetry, Grafana, Loki, Tempo, and Mimir

Extensive experience working with Linux operating systems focusing on metric collection and instrumentation
Implementing and scaling observability pipelines using self-managed, on premises, and open source software
Experience developing automation, orchestrations, and writing infrastructure as code for platform management

Comfortable working with scripting and interpreted languages, and test driven development
Excellent communication and listening skills, as well as a high degree of emotional intelligence

We’ll be super impressed if you have experience in any of these:

Deep understanding of challenges with high cardinality, churn, data volumes to anticipate capacity needs
A track record of working across multiple cloud platforms and physical environments to provide global visibility
Experience working with Clickhouse for time series data
Development of metrics exporters for the Prometheus ecosystem

APPLY HERE

Staff Site Reliability Engineer – Observability

About the Role

Find Us

Search

About This Site