Senior Site Reliability Engineer
Olo powers digital ordering and delivery programs that connect restaurant brands to the on-demand world, placing orders directly into the restaurant through all order origination points – from a brand’s own website or app, third party marketplaces, social media platforms, smart speakers, and home assistants. Olo serves as the on-demand ordering and delivery platform for over 300 brands, such as Applebee’s, Checkers & Rally’s, Cheesecake Factory, Chili’s, Dairy Queen, Denny’s, Five Guys Burgers & Fries, Jamba Juice, Noodles & Company, Portillo’s Hot Dogs, Shake Shack, sweetgreen, Wingstop, and more. Learn more at www.olo.com.
Olo's headquarters is located on the 82nd floor of One World Trade Center. We offer great benefits, such as 20 days of Paid Time Off, fully paid health, dental and vision care premiums, stock options, a generous parental leave plan, and perks like FitBits, rotating craft beers on tap in our kitchen, and food events featuring our clients' menu items (now you know why we give out FitBits!). Check out our culture map:https://www.olo.com/images/culture.jpg.
General overview of the role
Are you passionate about building highly available, performant and scalable web applications? We are looking for an ambitious Senior Site Reliability Engineer to join our team of software and infrastructure engineers. Olo is experiencing tremendous growth, and Reliability at Scale has become our key mantra. As we enhance our platform to support the increased demand, it must be positioned for continued stability, reliability and resiliency...even at 10x scale! You will be challenged with complex yet interesting problems, and your passion to succeed will be key.
You will partner with Engineering and Product Managers to continually learn, improve system availability and sharpen our execution skills as we deliver an awesome platform. Your focus will be on helping us improve system reliability while building and maintaining solutions. Your curiosity and passion for learning will help discover new ways for us to improve and deliver the best service to our customers.
At Olo, Site Reliability Engineering is a discipline that combines software and systems engineering to build and run web-scale, distributed, fault-tolerant and performant systems. As an SRE you will ensure that Olo's internal and external applications have reliability and uptime appropriate to end users' needs and a feedback loop focused on improvement while keeping a watchful eye on capacity and performance.
Take ownership of the entire process, from observability and SLIs/SLOs to Incident Response to postmortems and follow-up actions.
Work to define standards and best practices and help drive those into each team.
Help us implement and tailor our incident response tools in order to minimize outage durations.
Brainstorm, define, and build collaborative monitoring solutions with members across multiple product teams.
Contribute insights across teams to help us improve or re-architect existing systems to support scale, performance and extensibility.
Constantly re-evaluate our observability tooling to improve architecture, knowledge models, user experience, performance and stability.
Analyze and mature our processes around Incident Response, Observability, Postmortems and Predictive Monitoring.
Maintain production services by measuring and monitoring availability, latency and overall system health.
Influence an engineering culture of reliability, observability, and availability.
Strive to coach and mentor engineering teams through game days, SRE boot camps and other training and feedback channels.
Strong experience with monitoring systems like Datadog, Sumo Logic, Raygun, New Relic or similar.
Fluency in at least one Incident Management tool such as FireHydrant, OpsGenie, PagerDuty, VictorOps or similar.
Some past experience with build and deploy tools such as Jenkins, TeamCity, Octopus, CircleCI, etc.
You've been in the trenches building highly scalable, efficient, and resilient systems.
Prior hands-on software development experience highly desired.
Self-starter: can take high level direction and organize to achieve its objectives.
Highly motivated individual with a curiosity to learn as you grow.
Legally able to work in the U.S.
Willing to roll up your sleeves, work hard and be scrappy!
Skills considered as a good plus
Prior hands-on software development experience.
Experience with Ansible, Terraform or other Infrastructure-as-Code tools.
Experience with containers and container orchestration frameworks.
Expertise in guiding Incident Response, in terms of both process and tooling.
How to apply
Olo is an equal opportunity employer and diversity is highly valued at our company. All applicants receive consideration for employment. We do not discriminate on the basis of race, religion, color, national origin, gender identity, sexual orientation, pregnancy, age, marital status, veteran status, or disability status.
If you like what you read, hear, and/or know about Olo, and want to be a part of our team, please do not hesitate to apply! We are excited to hear from you!