A site reliability engineer (SRE) will spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling, or automation. The ideal site reliability engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of coding and automation. As an SRE in NE Digital, you will drive the initiatives to improve automation, scalability, and reliability of our core services such as Fairprice Online, Scan&Go, Identity, my first school, and much more. As a member of NTUC Enterprise Center of Excellence, you will be exposed to the latest technologies with AWS Cloud, Google Cloud Platform, Kubernetes, Kubeflow, ML/AI, Big Data, in a Hybrid/multi cloud environment. We are strong believers in DevSecOps, SRE, Agile, and FinOps. - Work with release engineers to ensure that the software delivery pipeline is as efficient as possible. - Collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. - Responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning - Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation, and refinement. - Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews. - Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. - Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity. - Practice sustainable incident response and blameless postmortems. - Documenting “tribal” knowledge.
- Bachelor's degree in Computer Science, a related technical field involving systems engineering, or equivalent practical experience. - Experience in Unix/Linux and/or Windows operating systems. - Experience in analyzing and troubleshooting systems. - Understanding of Infrastructure monitoring, logging, alerting release, and configuration management. - Understanding of networking (e.g. TCP/IP, routing, network topology, load balancers, DNS, NTP). - Experience in one of the following: Python, Go, Perl, Ruby or shell scripting. - Experience in Public Cloud, AWS, and/or GCP. - Experience maintaining Internet-facing production-grade applications. - Experience with software deployment and/or orchestration technologies, e.g., Puppet, Chef, Salt, Ansible, Docker, Kubernetes, Terraform. - Experience in CI/CD (e.g., JIRA, Git, Jenkins, Nexus, ...) - Experience in standard IT security practices (e.g., encryption, certificates, key management). - Excellent communication, and problem-solving skills with strong attention to detail. - Flexibility to work non-business hours that may include weekends and/or holidays. - Self-starter who is able to identify and perform tasks with minimal supervision. - Experience with GSuite apps (Gmail, Gsheet, Gdoc, ...)
1 MARINA BOULEVARD ONE MARINA BOULEVARD, 018989