Site Reliability Engineer
Keep production systems fast, available, and observable for millions of users — by writing software that automates operations, runs capacity planning, designs SLOs and error budgets, and owns the on-call rotation for critical services. The role sits between software engineering and operations: you write Go / Python / Rust code, build reliability tooling, design distributed systems for resilience, run incident response, and push back on product launches that risk SLOs. In India, SRE is a premium specialization concentrated at FAANG-IN (Google SRE Bengaluru, Amazon, Microsoft IDC, Netflix India), product unicorns (Razorpay, Flipkart, Swiggy, Dream11, PhonePe, Zerodha), and the GCCs of high-traffic US firms (Uber, LinkedIn, Atlassian, GitHub, Cloudflare, Stripe). The work overlaps with DevOps but skews more toward software engineering: reliability is a product, not a process. Senior SRE pay in India sits at the very top of the technology bracket, often above equivalent SDE-3 backend roles.
Overview
Keep production systems fast, available, and observable for millions of users — by writing software that automates operations, runs capacity planning, designs SLOs and error budgets, and owns the on-call rotation for critical services. The role sits between software engineering and operations: you write Go / Python / Rust code, build reliability tooling, design distributed systems for resilience, run incident response, and push back on product launches that risk SLOs. In India, SRE is a premium specialization concentrated at FAANG-IN (Google SRE Bengaluru, Amazon, Microsoft IDC, Netflix India), product unicorns (Razorpay, Flipkart, Swiggy, Dream11, PhonePe, Zerodha), and the GCCs of high-traffic US firms (Uber, LinkedIn, Atlassian, GitHub, Cloudflare, Stripe). The work overlaps with DevOps but skews more toward software engineering: reliability is a product, not a process. Senior SRE pay in India sits at the very top of the technology bracket, often above equivalent SDE-3 backend roles.
A Day in the Life
Open laptop in Bengaluru / Hyderabad apartment or office; pull main, scan Slack DMs and #sre-pings for overnight conversations from the US team.
Open PagerDuty and Grafana — review overnight pages, error-budget burn dashboards, and any slow-burn alerts that did not page but are trending bad.
Filter coffee or chai; reply to overnight Slack threads from product engineering teams asking about an SLO or a deployment gate.
15-minute standup over Zoom with SRE team — yesterday's pages, today's focus, blockers, on-call status.
Deep-work block 1: 2 hours writing Go code for a controller that auto-sheds non-critical reads on user-service when p99 latency crosses a threshold.
Lunch — South Indian thali or office cafe; informal chat with a senior SRE about whether to self-host Loki or stay on managed Datadog logs.
60-minute design review with product engineering for the new payments service launch — push for canary stages, error-budget burn alerts, and a hard SLO before GA.
Production-readiness review (PRR) checklist for a service hitting GA next sprint — sign off or block on missing observability, load tests, and runbooks.
Pair with junior SRE on writing their first blameless postmortem from yesterday's incident — coach on timeline, root cause, and action items.
Triage a slow burn on checkout-service error budget; identify a regression in a recent deploy; file a ticket; coordinate with the owning team for a rollback.
Deep-work block 2: continue the Go controller, write tests, run it against a staging fork; push the PR.
Push final commits, end-of-day Slack update; if on-call this week, set up phone with PagerDuty + a buddy contact for the night.
Off rotation: log off and dinner. On-call rotation: keep phone within arm's reach; pages possible at 2 AM IST especially for US-business-hours services.
Common Mistakes
7- ⚠️Trying to jump directly from college into SREWhy: Most Indian SRE managers reject candidates without 2-3 years of SDE or DevOps experience because SRE work assumes you can write production code and have operational scars. Direct fresher SRE entry exists only at FAANG-IN via top-college campus pipelines.Instead: Start as SDE or DevOps at a product company for 2-3 years; build Linux, Kubernetes, observability, and Go fluency; then switch to SRE-2 at the same or a stronger product company.
- ⚠️Confusing SRE with DevOps when negotiating titles and compWhy: DevOps roles in India pay 10-25% less than SRE roles for similar work; engineers who let companies brand them 'DevOps' lose comp leverage permanently because future recruiters anchor on the title.Instead: When negotiating, insist on SRE or Reliability Engineer titles if you'll be on-call for production with SLO ownership; the title matters for next-switch comp.
- ⚠️Skipping CKA / CKAD / CKS certificationsWhy: Kubernetes certs are unusually load-bearing in SRE hiring because they short-circuit screening; engineers without them lose interview slots to peers with similar skills who have the cert.Instead: Earn CKA before your first SRE switch; CKS within 18 months of becoming senior; budget 6-8 weeks evening prep per cert and treat it as required, not optional.
- ⚠️Avoiding on-call rotationsWhy: SRE promotions track on-call leadership — leading incidents, writing postmortems, owning SLOs. Engineers who duck on-call cap out at SRE-2 because they cannot demonstrate the senior-track skills.Instead: Volunteer for on-call from month 6; pair-page with a senior; lead at least 5-10 incidents end-to-end by year 2; treat blameless postmortems as your senior-promotion portfolio.
- ⚠️Staying in a low-traffic / low-incident environmentWhy: SRE skills compound only at companies with real reliability problems — high traffic, complex dependencies, frequent incidents. A low-traffic SRE role at a smaller company plateaus comp and skill growth within 2 years.Instead: By year 3-4, target a high-traffic product company (Razorpay, Flipkart, Swiggy, Dream11, Hotstar) or a global remote SRE role at Cloudflare, GitHub, Stripe; the war stories are what get you to senior.
- ⚠️Treating Go as optional, sticking only to Python and BashWhy: Most production SRE tooling — Kubernetes, Prometheus, Terraform, Argo, etcd — is written in Go; engineers without Go fluency cannot contribute to or extend the platform, capping their impact at script-level automation.Instead: Block 3 months to learn Go to a production level; build one Go tool (a custom Prometheus exporter, a webhook controller); contribute one PR to a CNCF project.
- ⚠️Not building a public profile (blog, GitHub, talks)Why: Senior SRE hiring is unusually portfolio-driven — open-source contributions to Prometheus / Kubernetes / OpenTelemetry, blog posts on incident response, conference talks at SRECon / KubeCon / Rootconf move offers from screening to 'we want them'.Instead: Publish one technical blog post per quarter; submit a talk to Rootconf or KubeCon India once a year; contribute one PR to a CNCF project; engineers without public profile lose to peers who write.
Salary by Indian City (Mid-level total cash comp)
6| City | Range |
|---|---|
| Bangalore | INR 22-35L base + ESOPs |
| Hyderabad | INR 20-32L base + ESOPs |
| Pune | INR 18-28L base + ESOPs |
| NCR (Gurgaon / Noida) | INR 18-28L base + ESOPs |
| Mumbai | INR 18-28L base + ESOPs |
| Remote-international | USD 70-150K (INR 58-125L) all-in |
Notable Indians in this career
6Communities + forums
7- Rootconf IndiaConference + YouTubeIndia's flagship SRE / DevOps / infra conference; annual event in Bengaluru with talks from Razorpay, Flipkart, Swiggy, Hotstar, Cloudflare SRE leaders. Recordings free on YouTube.
- KubeCon + CloudNativeCon IndiaConference + YouTubeCNCF's India-region KubeCon; sessions on Kubernetes, observability, service meshes; the canonical SRE / platform conference for Asian-region engineers.
- DevOps Bangalore MeetupMeetupLong-running Bengaluru DevOps / SRE meetup; in-person and hybrid sessions; speakers from Razorpay, Flipkart, Cloudflare, Atlassian.
- SRE WeeklyNewsletterWeekly curated SRE links — incident reports, postmortems, tooling updates; widely subscribed by Indian senior SREs.
- r/sreRedditGlobal SRE subreddit; daily threads on on-call, observability, Kubernetes, hiring; Indian seniors active.
- CNCF SlackSlackOfficial Cloud Native Computing Foundation Slack; channels for Kubernetes, Prometheus, OpenTelemetry, Argo, Istio; Indian SREs working with these tools active here.
- Kubernetes IndiaSlack + MeetupIndia-focused Kubernetes community; regional chapters in Bengaluru, Hyderabad, Pune, Mumbai; topical talks on K8s operators, multi-cluster, and security.
What to read / watch / follow
10- Site Reliability Engineering (the SRE book)Book (free online)by Beyer, Jones, Petoff, Murphy (Google)The canonical SRE reference; required reading for any Indian SRE-2 or above interview at FAANG-IN, Netflix India, Cloudflare; Google has made it free online.
- The Site Reliability WorkbookBook (free online)by Beyer, Murphy, Rensin, Kawahara, Thorne (Google)Practical companion to the SRE book; covers SLO design, error budgets, on-call practices; widely cited in Indian SRE interview discussions.
- Designing Data-Intensive ApplicationsBookby Martin KleppmannThe canonical distributed systems reference; required for any senior SRE system-design interview at FAANG-IN, Netflix, Cloudflare.
- Brendan Gregg's blog + USE methodBlog + Bookby Brendan GreggDeep Linux performance content (USE method, flamegraphs, eBPF) by the world's most-cited Linux perf engineer; required reading for Linux troubleshooting interviews.
- Kubernetes Up and RunningBookby Hightower, Burns, BedaThe default Kubernetes book for Indian SREs preparing for CKA and senior SRE interviews; covers fundamentals and operator patterns.
- SRE Weekly newsletterNewsletterby Lex NevaWeekly curated SRE links — incident reports, postmortems, tooling updates; Indian senior SREs subscribe.
- Datadog blog + Honeycomb blogBlogby Datadog, Honeycomb engineeringHigh-signal observability content — distributed tracing, SLO design, incident response; widely shared in Indian SRE communities.
- Charity Majors' blog (charity.wtf)Blogby Charity MajorsSharp opinionated writing on observability, on-call, and engineering culture; influential in Indian senior SRE thinking on team design.
- SREcon talks (USENIX YouTube)YouTubeby USENIX SREcon speakersAnnual SREcon talks from Google, Netflix, Cloudflare SREs; canonical content for senior SRE interview preparation.
- Rootconf India recordingsYouTubeby Rootconf IndiaIndian-context SRE / DevOps talks from Razorpay, Flipkart, Swiggy, Hotstar engineers; the closest thing to a SREcon equivalent for India-scale problems.
Daily Responsibilities
7- Review yesterday's pages, slow burns on error budgets, and Grafana dashboards — pick 2-3 leading-indicator regressions to investigate today.
- Write 2-4 hours of Go or Python code: a new exporter, a load-shedding controller, a runbook automation, or a Kubernetes operator.
- Lead or attend a 30-min incident review or postmortem read-out for an incident from earlier in the week.
- Review 2-4 production-readiness checklists or design docs from product engineering teams; push back on launches that break SLOs.
- Triage a Datadog or Prometheus alert that fired overnight — root-cause, mitigate, and write a Confluence post-mortem if user-impacting.
- Update a runbook or onboarding doc; mentor a junior SRE through their first solo on-call shift.
Advantages
- Top-of-bracket pay in Indian tech — senior SRE at FAANG-IN, Netflix India, Cloudflare, Stripe routinely earns ₹60L-1.5Cr+ total comp, often above equivalent SDE-3 backend roles because supply is small and impact is large.
- Strong intellectual range — you write distributed-systems code, run production diagnostics on Linux, design failure modes, build observability platforms; few roles in tech give you this much surface area.
- Genuine remote and hybrid options — most SRE roles at product cos (Razorpay, Cloudflare, GitHub, Stripe) are remote-first; on-call is location-independent; SRE is one of the most remote-friendly senior tech roles in India.
- Compounding skills — Linux internals, networking, distributed systems, observability, and production debugging are durable across companies, languages, and decades; SRE seniors don't get displaced by language fashion.
- Direct, measurable impact — you own crash rate, p99 latency, uptime, and on-call burden; the work shows up in user-facing metrics and revenue, which makes promotions and switches easier to justify.
Challenges
- On-call is intense — most SRE rotations are 1-week-on out of every 4-6 weeks; pages at 2 AM IST are common, especially for services with US-business-hours peak.
- Stress during incidents is genuinely high — production-down events with the CTO and CEO watching demand calm under pressure; not every engineer wants this kind of ownership.
- Hard to break in directly — most SRE roles require 2-4 years of SDE or DevOps experience first; freshers landing direct SRE roles at FAANG-IN happens but is competitive (often via campus drives at top colleges only).
- Ambiguous boundaries with platform / DevOps teams — at smaller companies you'll be doing all three roles; at larger ones you'll spend 30%+ of the time clarifying who owns what.
- Tooling churn — observability platforms (Prometheus, OpenTelemetry, Honeycomb, Datadog), incident response tools (PagerDuty, FireHydrant, incident.io), and orchestration (K8s, service meshes) all evolve rapidly; staying current is a continuous tax.
Education
6- Required (most common): B.Tech / B.E. in Computer Science, IT, or Electronics — the dominant route into FAANG-IN SRE (Google Bengaluru, Amazon, Microsoft IDC) and product-unicorn SRE roles.
- Strong alternatives: BCA, MCA, or B.Sc. (Computer Science) — accepted at most product startups; SRE hiring weights demonstrated systems experience and Linux internals heavily, so degree label matters less than what you've actually run.
- Premium signal: degree from IIT, NIT, IIIT, BITS for Google SRE India, Netflix, Cloudflare, Stripe, Atlassian — these companies hire on a brutal mix of DSA + Linux/networking + systems design at a high bar.
- Self-taught + portfolio: feasible with a strong path — 2-3 years of SDE/DevOps experience, deep Linux + networking knowledge, a public homelab or open-source contributions to observability tooling (Prometheus, Grafana, Loki, OpenTelemetry), CKA cert, and visible blog/conference content.
- Bootcamps: rarer for SRE specifically; KodeKloud, Linux Foundation, A Cloud Guru cover relevant tooling but real SRE depth comes from running production systems, not bootcamps.