PhD Student in Reliable Large Scale AI Infrastructures
Vor 3 Tagen
Huawei's vision is to enrich life through communication. We are a fast growing and leading global information and communications technology (ICT) solutions provider.
Driven by a commitment to operations, ongoing innovation, and open collaboration, we have established a competitive ICT portfolio of end-to-end solutions in Telecom and enterprise networks, Devices and Cloud technology and services.
Huawei is active in more than 170 countries and has over 197,000 employees of which more than 80,000 are engaged in research and development (R&D). With us you have the opportunity to work in a dynamic, multinational environment with more than 150 nationalities worldwide.
Huawei's TTE RAMS Lab is a corporate competence center responsible for researching high reliability and high safety architecture as well as technologies for complex intelligent system; Our goal is to provide Huawei products with cutting-edge researches and advanced technical solutions on intelligent reliability and safety for carrier grade ICT and safety critical systems such as autonomous driving so that our products provide our customers with best user experiences and performance.
We are seeking a highly motivated and talented PhD student to join our cutting-edge research team focused on large-scale reliable AI infrastructures. This position will emphasize ensuring the robustness and reliability of training and inferencing for large language models (LLMs). The ideal candidate will engage in both theoretical and practical research aimed at overcoming challenges related to scaling AI systems while maintaining reliability, resilience, and efficiency across various AI workflows.
As part of the team, you will have the opportunity to work at the forefront of AI infrastructure, addressing critical issues like fault tolerance, data and model consistency, distributed AI system infrastructure, and the optimization of machine learning pipelines.
Conduct advanced research on scaling and ensuring reliability in the training and inferencing of large language models (LLMs).
- Develop innovative methodologies for enhancing the fault tolerance and resilience of AI infrastructures, including the detection and recovery of system failures;
- Investigate and improve existing distributed systems architectures and parallel computing frameworks for large-scale machine learning tasks;
- Collaborate with interdisciplinary teams to design and implement strategies that ensure the robustness of AI systems in production environments;
- Contribute to the development of new algorithms and models that improve both the scalability and reliability of large-scale training and inferencing pipelines;
- Publish research findings in top-tier conferences and journals, and contribute to internal knowledge sharing;
- Assist in the development of open-source tools and frameworks for the research community.
- Strong academic background in Computer Science, Engineering, or a related field with a Master's degree;
- Demonstrated interest or experience in AI, machine learning, or large-scale distributed systems;
- Solid understanding of deep learning principles and techniques, particularly as applied to large language models (LLMs);
- Proficiency in programming languages such as Python, C/C++, or equivalent and experiences in system scripting such as Bash, Perl, sed&awk;
- Familiarity with deep learning frameworks (TensorFlow, PyTorch, etc.), understanding of the underlying low level technical details will be a big plus;
- Strong knowledge of Operating Systems (Linux-based), distributed computing, cloud infrastructure, and containerization technologies (e.g., Kubernetes, Docker);
- Excellent problem-solving skills, analytical thinking, and attention to detail;
- Ability to work collaboratively in a multidisciplinary team environment and communicate complex technical concepts effectively.
- Experience in large-scale AI/ML system deployment, optimization, or maintenance;
- Familiarity with the challenges and best practices in training very large neural networks;
- Background in systems engineering, cloud architectures, or high-performance computing (HPC);
- Knowledge of tools and technologies for distributed training (e.g., Horovod, DeepSpeed,Slurm, Ray, etc);
- Prior research or industry experience in AI model reliability, system fault tolerance, or similar areas;
- Publications in the relevant conferences and workshops are a big plus.
- Prior research or industry experience in AI model reliability, system fault tolerance, or similar areas;
- Knowledge of tools and technologies for distributed training (e.g., Horovod, DeepSpeed,Slurm, Ray, etc);
- Background in systems engineering, cloud architectures, or high-performance computing (HPC);
- Familiarity with the challenges and best practices in training very large neural networks;
- Strong knowledge of Operating Systems (Linux-based), distributed computing, cloud infrastructure, and containerization technologies (e.g., Kubernetes, Docker);
- Familiarity with deep learning frameworks (TensorFlow, PyTorch, etc.), understanding of the underlying low level technical details will be a big plus;
- Proficiency in programming languages such as Python, C/C++, or equivalent and experiences in system scripting such as Bash, Perl, sed&awk;
- Solid understanding of deep learning principles and techniques, particularly as applied to large language models (LLMs);
- Demonstrated interest or experience in AI, machine learning, or large-scale distributed systems;
- Investigate and improve existing distributed systems architectures and parallel computing frameworks for large-scale machine learning tasks;
Huawei is a leading global information and communications technology (ICT) solutions provider. Driven by a commitment to operations, ongoing innovation, and open collaboration, we have established a competitive ICT portfolio of end-to-end solutions in Telecom and enterprise networks, Devices and Cloud technology and services. Our ICT solutions, products and services are used in more than 170 countries and regions, serving over one-third of the world's population. With 197,000 employees, Huawei is committed to develop the future information society and build a Better Connected World.
Please send your application and CV (incl. cover letter and reference letters) in English.
-
PhD student
Vor 4 Tagen
Munich, Bayern, Deutschland IACR VollzeitPhD student Technical University of Munich, Germany A position for a PhD student in Cryptography is available in the newly formed research group led by Lorenz Panny in the Department of Mathematics, within the TUM School of Computation, Information and Technology, located at the Garching campus. The group was established in 2023 and primarily focuses on...
-
Head of AI
Vor 7 Tagen
Munich, Bayern, Deutschland Congrify VollzeitAt Congrify we are a payments data observability and intelligence company.We enable merchants and payment service providers increasing revenues, optimizing costs, ensuring compliant and efficient reconciliation processes.Payment professionals can free up their time without needing to spend hours in bringing together data from multiple payment service...
-
Intern/Working Student
vor 3 Wochen
Munich, Bayern, Deutschland VidLab7 VollzeitVidLab7 is the leading synthetic media platform for revenue teams, leveraging cutting-edge AI to redefine hyper-personalized video at scale and boost conversions. Backed by a €3.5M seed investment from EQT Ventures, we are building transformative technology with a fast-paced, high-performance team. This is an incredible opportunity for an intern or working...
-
Munich, Bayern, Deutschland Technical University of Munich VollzeitSenior scientific software developer for AI in regulatory genomics 18.03., Wissenschaftliches Personal About us The Chair of Computational Molecular Medicine, led by Prof Julien Gagneur, develops computational approaches to study the genetic basis of gene regulation and its implication in diseases. Applications of our work range from...
-
Working Student
Vor 3 Tagen
Munich, Bayern, Deutschland appliedAI TeilzeitAbout usThe appliedAI Initiative elevates Europe's industry to become shapers in the age of AI, creating a future that we desire to live in. Through our ecosystem of strong partners, comprehensive programs, services and solutions we advance companies holistically on their AI journey. With our uniquely diverse, interdisciplinary, and driven team, we build on...
-
Working Student
vor 5 Stunden
Munich, Bayern, Deutschland appliedAI TeilzeitAbout usThe appliedAI Initiative elevates Europe's industry to become shapers in the age of AI, creating a future that we desire to live in. Through our ecosystem of strong partners, comprehensive programs, services and solutions we advance companies holistically on their AI journey. With our uniquely diverse, interdisciplinary, and driven team, we build on...
-
Munich, Bayern, Deutschland appliedAI TeilzeitAbout usThe appliedAI Initiative elevates Europe's industry to become shapers in the age of AI, creating a future that we desire to live in. Through our ecosystem of strong partners, comprehensive programs, services and solutions we advance companies holistically on their AI journey. With our uniquely diverse, interdisciplinary, and driven team, we build on...
-
Working Student MLOps Engineer under the EU AI Act
vor 5 Stunden
Munich, Bayern, Deutschland appliedAI TeilzeitAbout usThe appliedAI Initiative elevates Europe's industry to become shapers in the age of AI, creating a future that we desire to live in. Through our ecosystem of strong partners, comprehensive programs, services and solutions we advance companies holistically on their AI journey. With our uniquely diverse, interdisciplinary, and driven team, we build on...
-
Applied AI Researcher
Vor 3 Tagen
Munich, Bayern, Deutschland appliedAI VollzeitAbout usAI is a powerful technology reshaping society, but it also presents complex challenges. The appliedAI Institute for Europe stands for Trustworthy AI by focusing on fairness, transparency, and accountability to ensure AI driven innovation while respecting ethical standards. As a central harbour in the European AI ecosystem, we provide knowledge,...
-
Applied AI Researcher
vor 5 Stunden
Munich, Bayern, Deutschland appliedAI VollzeitAbout usAI is a powerful technology reshaping society, but it also presents complex challenges. The appliedAI Institute for Europe stands for Trustworthy AI by focusing on fairness, transparency, and accountability to ensure AI driven innovation while respecting ethical standards. As a central harbour in the European AI ecosystem, we provide knowledge,...
-
AI Strategist
vor 5 Stunden
Munich, Bayern, Deutschland appliedAI VollzeitAbout usThe appliedAI Initiative elevates Europe's industry to become shapers in the age of AI, creating a future that we desire to live in. Through our ecosystem of strong partners, comprehensive programs, services and solutions we advance companies holistically on their AI journey. With our uniquely diverse, interdisciplinary, and driven team, we build on...
-
SRE, Engines, EMEA
Vor 3 Tagen
Munich, Bayern, Deutschland Firebolt VollzeitAbout FireboltFirebolt is the Cloud Data Warehouse designed to handle the speed, scale, and flexibility of AI applications. By delivering ultra-low latency, high concurrency, multi-dimensional elasticity, and flexibility, Firebolt empowers organizations to build data-intensive AI applications that perform at scale. With over $270m in funding to date, a...
-
Applied AI Consultant
Vor 4 Tagen
Munich, Bayern, Deutschland Celonis VollzeitWe're Celonis, the global leader in Process Mining technology and one of the world's fastest-growing SaaS firms. We believe there is a massive opportunity to unlock productivity by placing data and intelligence at the core of business processes - and for that, we need you to join us. The Team: Our newly formed Applied AI team develops AI solutions to...
-
Munich, Bayern, Deutschland Huawei Research Center Germany & Austria VollzeitHuawei's vision is to enrich life through communication. We are a fast growing and leading global information and communications technology (ICT) solutions provider.Driven by a commitment to operations, ongoing innovation, and open collaboration, we have established a competitive ICT portfolio of end-to-end solutions in Telecom and enterprise networks,...
-
Senior Cloud Infrastructure Client Partner
Vor 3 Tagen
Munich, Bayern, Deutschland NTT VollzeitJOB DESCRIPTION Make an impact with NTT DATA Join a company that is pushing the boundaries of what is possible. We are renowned for our technical excellence and leading innovations, and for making a difference to our clients and society. Our workplace embraces diversity and inclusion – it's a place where you can grow, belong and thrive. Your day at...
-
Munich, Bayern, Deutschland Technical University of Munich VollzeitStudentische Hilfskräfte, Praktikantenstellen, Studienarbeiten 17.06., Studentische Hilfskräfte, Praktikantenstellen, Studienarbeiten As a Student Researcher at Pendulum, you will have the opportunity to push the boundaries of the field in AI and in supply chains, by collaborating on building, designing and deploying cutting-edge AI agents....
-
Agentic AI Engineer
vor 2 Wochen
Munich, Bayern, Deutschland IPPEN VollzeitAgentic AI Engineer (m/w/d) Jetzt bewerben Unser AI-Team in München sucht dich als Agentic AI Engineer (m/w/d) Bist du bereit, die Zukunft der künstlichen Intelligenz mitzugestalten? Werde Teil unseres dynamischen AI-Teams und treibe die Entwicklung innovativer KI-Agenten voran Was dich erwartet: • Du bist verantwortlich für den Entwurf, Aufbau und die...
-
Senior AI Engineer
vor 2 Wochen
Munich, Bayern, Deutschland IPPEN VollzeitSenior AI Engineer (m/w/d) Jetzt bewerben Für den Ausbau unseres cross-funktionalen Teams in München suchen wir ab sofort eine:n Senior AI Engineer (m/w/d) Pioniergeist gesucht Werde Teil unseres News-Publishing Unternehmens und trage dazu bei, den Digitaljournalismus im Bereich der künstlichen Intelligenz für Redaktionen weiterzuentwickeln. Wir suchen...
-
AI Engineer, Internship, Germany
vor 3 Wochen
Munich, Bayern, Deutschland Boston Consulting Group Vollzeit*Locations*: München | Berlin | FrankfurtWho We AreBoston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in 1963. Today, we help clients with total transformation-inspiring complex change, enabling...
-
AI Engineer, Internship, Germany
vor 2 Wochen
Munich, Bayern, Deutschland Boston Consulting Group Vollzeit*Locations*: München | Berlin | FrankfurtWho We AreBoston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in 1963. Today, we help clients with total transformation-inspiring complex change, enabling...