Site Reliability Engineering

vor 2 Wochen

Munich, Deutschland Infosys Vollzeit

The Role We are looking for a visionary and highly experienced SRE Architect to lead the design and implementation of our reliability and scalability strategy. You will be the principal architect responsible for creating the blueprint for our production systems, ensuring they are resilient, performant, and highly available. This is a senior-level role that combines deep technical expertise with strategic thinking to influence the entire engineering organization. You will define the standards and frameworks that empower our SRE and development teams to build and operate world-class services.Key Responsibilities Architectural Design & Strategy: Design and architect robust, scalable, and fault-tolerant infrastructure and application services on public cloud platforms (AWS, GCP, Azure). Define the long-term vision for system reliability and performance. Reliability Frameworks: Establish and govern the standards for Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across all engineering teams. Observability & Telemetry: Architect a comprehensive observability strategy. Design the systems for logging, metrics, tracing, and alerting to provide deep insights into system health and facilitate rapid incident response. Automation & Infrastructure as Code (IaC): Lead the strategy for automation and IaC. Design reusable patterns and frameworks using tools like Terraform and Ansible to ensure consistent, repeatable, and secure infrastructure provisioning. Resilience & Chaos Engineering: Proactively identify and mitigate reliability risks. Design and champion the implementation of resilience patterns, disaster recovery plans, and chaos engineering experiments to validate system robustness. Technical Leadership & Mentoring: Act as a thought leader and subject matter expert in reliability engineering. Mentor SREs and developers, evangelize best practices, and lead architectural review sessions to ensure reliability is a core component of every feature. Incident Management Evolution: While not the primary on-call responder, you will analyze major incidents to identify architectural weaknesses and drive the necessary design changes to prevent recurrence. You will help evolve our postmortem culture and incident response capabilities. Required Qualifications & Skills Experience: 10+ years of experience in software engineering, DevOps, or systems engineering, with at least 5 years in a senior SRE or systems architecture role. Cloud Expertise: Expert-level knowledge of at least one major cloud provider (AWS, GCP, or Azure), including core services like compute, storage, networking, and managed databases. Containerization & Orchestration: Deep, hands-on experience designing and managing large-scale Kubernetes clusters and container-based microservices architectures. Infrastructure as Code (IaC): Proven expertise in architecting infrastructure with Terraform. Proficiency with configuration management tools like Ansible, Chef, or Puppet. Observability Platforms: Extensive experience designing and implementing monitoring and observability solutions using tools like Prometheus, Grafana, OpenTelemetry, Jaeger, and the ELK Stack (Elasticsearch, Logstash, Kibana) or similar commercial tools (e.g., Datadog, New Relic). Programming/Scripting: Strong proficiency in a high-level programming language such as Go or Python for automation, tooling, and building system integrations. Systems Design: Deep understanding of distributed systems, networking protocols (TCP/IP, HTTP), and high-availability design patterns. Preferred Qualifications Experience working across multiple cloud environments (multi-cloud). Professional cloud certifications (e.g., AWS Certified Solutions Architect Professional, Google Professional Cloud Architect). Experience with service mesh technologies like Istio or Linkerd. Knowledge of security best practices in a cloud-native environment (DevSecOps). Demonstrated experience leading large-scale technology transformations and influencing engineering culture. About your team Our CRL (Consumer Goods, retail & Logistics) practice helps some of the largest global firms and most recognizable local brands solve their biggest challenges in today’s age of constant disruption. With diverse services spanning growth strategy and new product innovation, to omni-channel customer experience, supply chain resiliency and AI-driven new business models, we help clients shape and achieve their growth agenda for a sustainable future. We transform traditional organizations to digitally centric business models and drive new revenue streams.

Site Reliability Engineering

vor 2 Wochen

Munich, Deutschland Infosys Consulting - Europe Vollzeit

Site Reliability Engineering (SRE) Architect – CRL – GermanyDo you want to boost your career and collaborate with expert, talented colleagues to solve and deliver against our clients' most important challenges? We are growing and are looking for people to join our team. You'll be part of an entrepreneurial, high-growth environment of 300.000 employees....
Teamlead for Site Reliability Engineering

vor 1 Woche

Munich, Deutschland OSBRA – Formteile GmbH Vollzeit

Parkdepot GmbH 4,1 (52 Bewertungen auf ) Teamlead for Site Reliability Engineering (M/F/d) Ab sofort gesucht (unbefristet) 40 h pro Woche Kein Gehalt angegeben Teilweise Homeoffice möglich As the Teamlead for Site Reliability Engineering (M/F/d) you will lead a team who is responsible for all processes around maintaining our IoT fleet and cloud...
Senior Software Engineer, Site Reliability Engineering

vor 1 Woche

Munich, Bayern, Deutschland Google Vollzeit 80.000 € - 120.000 € pro Jahr

Minimum qualifications:Bachelor's degree in Computer Science, a related field, or equivalent practical experience.5 years of experience with software development in one or more programming languages.3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems.2 years of experience leading projects and providing technical...
Senior Software Engineer, Site Reliability Engineering

Vor 5 Tagen

Munich, Bayern, Deutschland Google Vollzeit 70.000 € - 110.000 € pro Jahr

Minimum qualifications:Bachelor's degree in Computer Science, a related field, or equivalent practical experience.5 years of experience with software development in one or more programming languages.3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems.2 years of experience leading projects and providing technical...
Site Reliability Engineer

vor 1 Tag

Munich, Bayern, Deutschland Workaround GmbH Vollzeit 96.000 € - 126.000 € pro Jahr

#ProGloveAt ProGlove, we're tackling one of the biggest challenges of our time: shrinking and aging workforces in industries that can't simply automate their way forward. Most companies will rely on human workers for decades to come — and we build the technology that keeps those people safe, healthy, and hyper-efficient. Our wearable solutions and...
Senior Product Manager

vor 2 Wochen

Munich, Deutschland Munich Re Vollzeit

Your Job Define and execute the strategic vision and lead the activities related to the transformation of our Digital Operations service in line with concepts like DevSecOps, Site Reliability Engineering (SRE), Platform Engineering, operational observability and automation.Establish an agile product mindset, understanding the customer or consumer of our...
Internship - Reliability Engineering & Data

vor 24 Stunden

Munich, Deutschland IDEALworks GmbH Vollzeit

**What you’re keen to do**: - Collaborate with the Site Reliability Engineering team to analyze robotics telemetry data based on the VDA 5050 standard. - Implement data observability practices to ensure the reliability and integrity of telemetry data. - Work on proactive maintenance strategies based on data insights to enhance system reliability. -...
Site Reliability Engineer

Vor 6 Tagen

Munich, Bayern, Deutschland ICT Digital Solutions Vollzeit 80.000 € - 120.000 € pro Jahr

Zur Verstärkung unseres Teams suchen wir zum nächstmöglichen Zeitpunkt einen Site Reliability Engineer (m/w/d) am Standort Ismaning bei München oder remote. Deine Aufgaben Gewährleistung der Verfügbarkeit, Leistung und Skalierbarkeit von digitalen Plattformen und Services.Entwicklung, Implementierung und Wartung von automatisierten Lösungen zur...
Senior Site Reliability Engineer

vor 2 Wochen

Munich, Deutschland XEMPUS Vollzeit

Who we are We are Xempus, Germany´s leading independent Software-as-Service (SaaS) platform for the management and distribution of pension, life and health insurance. Our mission: making pension, life and health insurance digital and understandable, efficient and accessible for everyone. Since 2007, we have been constantly working to drive the...
Site Reliability Engineer

Vor 5 Tagen

Munich, Bayern, Deutschland Exaring AG Vollzeit 80.000 € - 120.000 € pro Jahr

About Exaring AGOur platform offers IPTV live streaming: Free TV, Pay TV, NewTV, Video-on-Demand, recordings, restart, and timeshift – all in a single app on a wide range of devices, such as smartphones, tablets, and TVs (FireTV, Apple TV, Smart TVs, and our own stick).At Exaring AG, we operate the entire platform and handle the complete process: from...

Amerika

Europa

Asien / Ozeanien

Afrika

Site Reliability Engineering