Site Reliability Engineer
HKT
- Αθήνα
- Μόνιμη
- Πλήρης Απασχόληση
- System Reliability: Oversee the availability, performance, and scalability of edge cloud services and infrastructure, ensuring they meet or exceed our service-level objectives and agreements.
- Incident Management: Lead the response to service incidents and outages, including participating in on-call rotations, resolving issues efficiently, and conducting thorough post-incident analyses.
- Performance Optimization: Continuously monitor and optimize system performance, identifying and addressing bottlenecks to improve efficiency and reduce latency.
- Capacity Planning: Conduct capacity planning and forecasting to accommodate system growth and peak loads, ensuring system resilience and performance.
- Automation: Develop and implement automation strategies for operational tasks and deployment processes to enhance system stability and reduce manual errors.
- Disaster Recovery: Design and manage disaster recovery plans, ensuring data integrity and system resilience against potential threats.
- Security: Enforce robust security policies and practices, regularly audit systems for vulnerabilities, and apply necessary security patches and updates.
- Collaboration: Work closely with development teams and other stakeholders to ensure the reliability and scalability of systems and services.
- Continuous Improvement: Lead initiatives to continuously improve processes, practices, and systems, ensuring the highest levels of reliability and efficiency.
- Documentation: Create and maintain detailed documentation for system architectures, configurations, and operational procedures.
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience.
- Proven experience of 3+ years as a Site Reliability Engineer, DevOps Engineer, or similar role in a complex networking environment.
- Strong background in Linux/Unix administration and scripting languages such as Python or Bash.
- Experience with automation/configuration management tools (e.g., Ansible, Git).
- Familiarity with cloud services (AWS, GCP, Azure) and container orchestration tools (e.g., Kubernetes).
- Deep understanding of network protocols and services (DNS, HTTP/S, SSH, FTP).
- Excellent problem-solving, troubleshooting, and communication skills.
- Ability to work in a fast-paced, evolving environment and collaborate effectively with a diverse team.