Senior DevOps / Site Reliability Engineer (SRE) - Hasura Cloud vacancy at Hasura
We are a globally distributed team, with offices in San Francisco & Bangalore.
General overview of the role
DevOps Engineers and Site Reliability Engineers (SREs) are responsible for keeping Hasura Cloud systems running smoothly and making sure updates can be rolled out reliably without any downtime.
Hasura Cloud is a unique GraphQL infrastructure product where we try to abstract away all the raw compute concepts from the user. A user should be able to only care about the limits they set based on the concurrent requests and latencies they can afford and the system should be able to accommodate and scale to the user requirements.
Be on a PagerDuty rotation to respond to Hasura Cloud availability incidents and provide support for service engineers with customer incidents.
Use your on-call time to be on the front line: respond to incidents, and take action to fulfil our SLOs. Use your dev time to address the systemic issues you’ve identified, to proactively prevent incidents from happening.
Run our infrastructure with Terraform, Kubernetes, VMs and bare metal instances.
Design smart monitoring that alerts on symptoms (our SLIs) rather than on causes, to make each alert meaningful and actionable.
Document every action so your findings turn into repeatable actions–and then into automation.
Improve the deployment process to make it as boring as possible.
Design, build and maintain core infrastructure pieces that allow Hasura Cloud scaling to support thousands of concurrent requests from our users.
Debug production issues across services and levels of the stack.
Expand Hasura Cloud to support multiple Cloud providers.
Plan the growth of Hasura Cloud's infrastructure.
Think about systems - edge cases, failure modes, behaviors, specific implementations.
Know your way around Linux and the Unix Shell.
Know what is the use of declarative infrastructure tools like Terraform.
Have strong programming skills (Go/Python).
Have an urge to collaborate and communicate asynchronously.
Have an urge to document all the things so you don't need to learn the same thing twice.
Have an urge to build automation and tooling so that you never have to do the same work twice.
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
Have an urge for delivering quickly and iterating fast.
Have experience with Nginx, Openresty, Docker, Kubernetes, Terraform, or similar technologies.
Have experience with various Cloud providers like AWS, GCP, Azure, DO etc., their systems, products and APIs.
Have experience with monitoring tools like Datadog/Prometheus/Grafana.
Skills considered as a good plus
Have experience with Hasura and its GraphQL APIs.
Have strong fundamentals in SQL, particularly with PostgreSQL.
Have experience with database management and scaling.
This role is fully remote. We hire in most countries. If you're applying from the US, we hire remotely in these 10 states in the US: Illinois, Virginia, California, Washington State, Maryland, Florida, Colorado, Massachusetts, Oregon, New York or this role will be based out of our office in Bangalore, India.