Ai Hpc Cluster Administrator

Vor 6 Tagen

Tübingen, Deutschland Universitätsklinikum Tübingen Vollzeit

The Faculty of Medicine is one of the four founding faculties of the Eberhard Karls University of Tübingen. With its non-clinical facilities as well as its research and teaching area corresponding to the organisational units of the University Hospital, it is one of the largest medical training and research institutions in Baden-Württemberg.
The "Hertie Institute for AI in Brain Health" (Hertie AI) is looking as soon as possible for a

**AI HPC Cluster Administrator (f/m/d)**:
The position initially will be filled on a fixed-term basis until 31.01.2028 with a strong prospect of extension.
- The "Hertie Institute for AI in Brain Health" (Hertie AI) is a research institute of the Faculty of Medicine, funded by the Gemeinnützige Hertie Stiftung, with the aim of detecting diseases of the nervous system earlier and treating them better with the help of artificial intelligence. Currently, Hertie AI is in a dynamic build-up phase. Hertie AI cooperates with the strong and innovative AI ecosystem in Tübingen (e.g. Cyber Valley, Cluster of Excellence “Machine Learning in Science”, Tübingen AI Center). Hertie AI uses and benefits greatly from shared infrastructures with these initiatives, like the Machine Learning Cloud (ML Cloud), but has special compute requirements due to its goal to analyze brain data and simulate neural circuits. The ML Cloud, is a state-of-the-art compute infrastructure with powerful AI CPU and GPU compute capacities, petabyte-scale storage volumes, used by more than 400 researchers and engineers.

**About the role**:

- We are seeking a skilled and proactive Cluster System Administrator to join our team, responsible for managing and optimizing our high-performance computing environment specifically designed for AI workloads. In this role, you will work closely with a team of HPC experts, AI researchers, and IT specialists to ensure that our systems operate at peak performance, supporting AI and ML teams with reliable, scalable computing resources.

**What you'll do**:

- **Cluster Management**:Oversee and manage daily operations of the compute infrastructure, including configuration, deployment, and optimization of nodes and networks to maximize performance for AI workloads
- **System Monitoring and Maintenance**: Monitor system performance, storage, and network utilization to ensure the clusters operate efficiently. Address hardware and software issues as they arise
- **User Support**: Provide technical assistance to AI researchers, data scientists, and developers on efficient use of cluster resources.
- **Documentation and Reporting**: Create and maintain comprehensive documentation on system configuration, maintenance tasks, and troubleshooting procedures. Generate regular reports on system performance, uptime, and resource usage for management

**What you will bring (position requirements)**:

- **Education and Experience**: Specialist knowledge and professional experience in information technology, applied computer science or computer engineering equivalent to the level of a Master's degree
- **Technical Skills**: Proficiency in HPC cluster management tools (e.g., SLURM, PBS, or Torque), Linux system administration
- **Scripting and Automation**:Strong scripting skills in Python, Bash, or other languages to automate tasks, optimize processes, and improve system reliability
- **Networking and Storage**: Solid understanding of high-speed networking, parallel file systems, and large-scale storage solutions (e.g., Lustre, Ceph)
- **Problem-Solving**: Excellent troubleshooting abilities and a proactive approach to resolving system issues before they impact users. Interest in artificial intelligence and motivation to collaborate with scientists and professionals in the field of AI research
- English proficiency

**Relevant experience in some of the following technologies**:

- Experience with automation tools for configuration management (e.g. Ansible, Puppet, Chef) and revision control systems (e.g. Git)
- Experience with containers (Docker/ Singularity/Podman / Kubernetes)

**What we offer**:

- Collaboration in the multifaceted environment of a modern university hospital, which in addition to patient care, also focuses on medical research and teaching
- Future-proof workplace and location as well as attractive remuneration including a company pension scheme (VBL) and at the same time the most flexible working hours possible
- Subsidization of the job ticket for public transport and attractive discounts on employee offer platforms
- Structured onboarding phase, clinic's own academy to develop professional, social and methodological skills
- Preventive health care through a wide range of sports activities

**Contact**:
**If you have any questions, please contact**:
**Dr. Kristina Kapanova**

**01.01.2025**
including CV and cover letter under specification of the **index number 5579**.

**Share job**

Weitere offene Stellen

Storage System Administrator

vor 2 Wochen

Tübingen, Deutschland Max-Planck-Gesellschaft Vollzeit

**EDV & INFORMATIK**TÜBINGEN** - Neurobiologie Kognitionsforschung - Stellenangebot vom 13. Dezember 2024 DENKEN - VERSTEHEN - LERNEN - das ist der Forschungsschwerpunkt am Max-Planck-Institut für biologische Kybernetik. Wissenschaftlerinnen und Wissenschaftler nähern sich diesem Ziel anhand unterschiedlicher Ansätze und Methoden - von der Messung bis...

Amerika

Europa

Asien / Ozeanien

Afrika

Ai Hpc Cluster Administrator

Storage System Administrator