Site Reliability Engineer

vor 2 Monaten


Unterföhring, Bayern, Deutschland Virtual Minds GmbH Vollzeit
Job Description

Virtual Minds GmbH is a leading provider of premium Adtech solutions in Europe, with over 20 years of experience in the digital advertising market. We are seeking a highly skilled Site Reliability Engineer to join our team and contribute to the growth of our dynamic and innovative organization.

Key Responsibilities
  • Design, deploy, and manage our Kubernetes platform to support scalable and reliable application deployments.
  • Oversee the deployment of our Software-as-a-Service applications on the Kubernetes platform, implementing best practices for application scalability, high availability, and disaster recovery.
  • Implement robust monitoring, alerting, and logging systems to proactively identify and resolve potential issues, ensuring high system availability and quick incident response times.
  • Continuously optimize the Kubernetes infrastructure and SaaS applications to achieve maximum performance and efficiency, conducting performance testing and tuning to meet or exceed service level objectives.
  • Participate in an on-call rotation to respond to incidents promptly and effectively, conducting thorough post-incident reviews to identify root causes and implement preventive measures.
  • Develop and maintain automation tools and scripts to streamline processes and improve the efficiency of operational tasks.
  • Implement security best practices for Kubernetes and SaaS applications, collaborating with the security team to ensure compliance with industry standards and regulations.
  • Work closely with cross-functional teams, including development, infrastructure, and product management, to provide expertise and support throughout the software development lifecycle.
  • Identify areas for improvement in the infrastructure, processes, and deployment methodologies, proposing and implementing enhancements to increase system reliability and performance.
Requirements
  • Significant relevant experience as a Site Reliability Engineer, DevOps Engineer, or in a similar role, with a strong focus on Kubernetes platform management and SaaS deployment.
  • Proficiency in managing Kubernetes clusters and related tooling, including Helm, kubectl, and operators.
  • Experience with container orchestration, service mesh, and Kubernetes networking.
  • Significant experience with AWS, especially services like EKS, MSK, RDS, S3, CloudTrail, CloudWatch, and deploying and managing the AWS infrastructure as code using Terraform and ArgoCD.
  • Solid programming skills in languages such as Python or Go, with proficiency in scripting to automate tasks and develop tooling.
  • Experience with monitoring solutions, such as Prometheus and Grafana, and centralized logging platforms, such as the ELK stack.
  • Knowledge of continuous integration and continuous deployment pipelines, preferably with tools like Jenkins, GitLab CI/CD, or Tekton.
  • Understanding of networking concepts and security best practices in the context of Kubernetes and SaaS deployments.
  • Strong analytical and problem-solving abilities to diagnose and resolve complex technical issues.
  • Excellent teamwork and communication skills to collaborate effectively with various teams and stakeholders.
  • A passion for staying up-to-date with the latest technologies, industry trends, and best practices in SRE and Kubernetes.
What We Offer
  • A dynamic and innovative work environment with a team of top experts.
  • A start-up atmosphere with the benefits of a large corporation.
  • Flexible working hours and 30 days of vacation.
  • A wide range of internal and external training opportunities for personal and professional development.
  • Additional benefits, such as employee discounts, bicycle leasing, and subsidised company pension schemes.