Start date: ASAP
Duration: 6 Months
Location: 2 days per week in Portsmouth office
Rate: £450 - £488 per day OUTSIDE IR35
SECURITY CLEARANCE REQUIRED
Summary:
As a Site Reliability Engineer (SRE), you will support the reliability, availability, performance, and security of a shared Platform as a Service (PaaS) used by multiple delivery teams. Operating at SFIA Level 4 (Enable), you will apply established SRE practices to ensure platform stability, automate operational tasks, and improve service resilience.
You will work closely with platform engineers, developers, security, and live service teams to support the safe, efficient delivery of digital services in line with DDaT and government standards and in a timely fashion.
Responsibilities:
Service reliability & operations.
- Maintain and improve the availability, reliability, and performance of the PaaS.
- Support live services, including incident response, investigation, and resolution, following agreed runbooks, and escalation paths.
- Participate in on-call rotas and contribute to incident post-incident reviews (PIRs), identifying root causes and improvement actions. Nominal Service hours are 09H00 to 17H00 week daily. Out-of-hours support is pre-arranged as required (rare).
- Monitor platform health using logs, metrics, and alerts, proactively identifying, and resolving issues.
- Automate repeatable operational tasks to reduce toil and improve platform reliability.
- Contribute to infrastructure and configuration management using Infrastructure as Code (IaC) approaches.
- Support continuous improvement of operational processes, reliability patterns, and resilience practices.
- Support development teams consuming the PaaS, helping them adopt platform standards and reliability best practices.
- Work with security and compliance teams to ensure the platform meets government security, resilience, and audit requirements (JSP453).
- Contribute to platform documentation, runbooks, and knowledge sharing.
- Collaborate within multidisciplinary teams using agile and DevOps practices.
- Support safe deployment and release processes, including monitoring changes in live environments.
- Assist with capacity planning and performance testing activities.
- Ensure changes are implemented in line with change management and live service standards.
- Live service operations & incident management experience
- Strong automation & scripting capability
- K8 & Cloud compute platform (e.g. AWS) experience
- Experience supporting live digital services in a production environment.
- Practical knowledge of cloud platforms and PaaS concepts (e.g. managed computer, networking, storage, CI/CD).
- Experience with container platforms (e.g. Kubernetes) or managed PaaS offerings.
- Experience with monitoring, logging, and alerting tools (e.g. Promethus, Grafana, Elastic).
- Ability to diagnose and resolve technical issues using established processes and tooling.
- Experience writing scripts or automation using languages such as Python, Bash, or similar.
- Understanding of reliability engineering concepts, including incident management, resilience, and failure modes.
- Ability to work independently on defined tasks and contribute effectively within a team.
- Experience using Infrastructure as Code tools (e.g. Terraform, CloudFormation).
- Experience working in a government or regulated/secure environment.
- Familiarity with SRE practices such as error budgets and blameless post-incident reviews.
- Knowledge of security and compliance controls relevant to live services.
- Experience using Jira and wider Atlassian project suite (e.g. confluence)