SITE RELIABILITY ENGINEERNTUC ENTERPRISE NEXUS CO-OPERATIVE LIMITED
A site reliability engineer (SRE) will spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal site reliability engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of coding and automation. As a SRE in NE Digital, you will drive the initiatives to improve automation, scalability and reliability of our core services such as Fairprice Online, Scan&Go, Identity, my first skool and much more. As a member of NTUC Enterprise Center of Excellence you will be exposed to the latest technologies with AWS Cloud, Google Cloud Platform, Kubernetes, Kubeflow, ML/AI, Big Data, in Hybrid/multi cloud environment. We are strong believers in DevSecOps, SRE, Agile and FinOps. Work with release engineers to ensure that the software delivery pipeline is as efficient as possible. Collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. Responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortem. Documenting “tribal” knowledge.
Work Location (MRT)