
Site Reliability Engineer
- Ελλάδα
- Μόνιμη
- Πλήρης Απασχόληση
- Build software that enhances Paymentology services' scalability and reliability.
- Ensure platform services meet required uptime and service quality levels.
- Contribute to the design of reliable cloud infrastructure and implement reusable cloud-uptime components as code.
- Regularly review and optimise SRE practices, tools, and methodologies to enhance overall system reliability and team efficiency.
- Contribute to the design, implementation, and maintenance of observability and monitoring solutions to track the platform health, its cost-effectiveness, the reliability, and scalability, and identify potential issues which can be fed back to product and platform engineering in a continuous improvement loop.
- Develop and implement automation scripts and tools to streamline operations and reduce manual interventions.
- Enable product teams to self-serve by participating in the development of a developer platform.
- Play an active role with the incident response teams, diagnosing and resolving production issues quickly to minimise downtime.
- Support product teams in building services that adhere to our security and quality standards.
- Work closely with engineering, operations, and product teams to ensure reliability is considered throughout the end-to-end software development lifecycle. We seek to achieve this through advocacy and developing a culture of reliability.**
- Strong understanding of cloud networking principles.
- Proficiency with leading monitoring tools, such as Datadog, Honeycomb.io, Splunk, Prometheus, Grafana, ELK Stack, and New Relic.
- Programming expertise, especially in systems programming languages and databases
- Familiarity with one of these industry-leading CI/CD tools such as Jenkins, GitHub Actions, Gitlab CI, CodePipelines, CircleCI, and ArgoCD.
- Proven in achieving platform-level and end-to-end SLIs, SLOs, and SLAs, and fostering accountability.
- Ability to navigate complex situations and lead effective post-incident reviews (PIRs).
- Knowledge of implementing solutions to reduce Mean Time to Identify (MTTI) and Mean Time to Resolve (MTTR).
- Comprehensive understanding of large-scale distributed platform architecture.
- Expertise in implementing best practices for load balancing, fault tolerance, and resource allocation to maintain service quality and efficiency at scale.
- Understanding of security best practices within cloud environments.
- Bachelor's Degree in Computer Science, Information Technology, or related field.
- Professionals with a verifiable employment history in the role may also be considered.
- 2+ years of experience as a Site Reliability Engineer.
- 2+ years in software development.
- Extensive cloud experience, especially with AWS.
- Proven expertise in one of the infrastructure-as-code using Terraform, CloudFormation, Puppet, and Ansible.
- Hands-on experience with Docker, ECS, EKS, and Kubernetes.
MyCarriera