Microsoft SDE - 2 Hiring
November 3, 2023
Microsoft Off Campus SDE - 2 Hiring Details:
Overview
Azure Singularity team is looking for passionate engineers to build the largest deep-learning infrastructure service at Microsoft. In this role you will be tasked with building new components to bring the latest innovations in AI Infrastructure onto the Singularity platform. You will partner with top engineering talent within Singularity and across Azure to work on cluster orchestration, job scheduling, containerization and operating system integration. Your work will enable various AI languages and run-times on Singularity to bring distributed deep learning training and inferencing to life. In addition, you will build infrastructure components required to build, deploy, monitor and service highly available and scalable Microsoft Service Fabric and Kubernetes clusters under your care. You will lead development and customer support from the frontline and establish architecture, service excellence guidelines and a high-quality bar.
Candidates must have a track record for delivering engineering and service excellence on a mid to large scale service.
Who We Are
We are the engineers on Singularity. We believe that building a planet-scale AI Supercomputer from the ground-up which addresses the fundamental pain-points of data scientists and AI practitioners and takes AI to the unprecedented scale is an opportunity of a lifetime. If you share the same dream as us, come join us!
What Is Singularity?
High scale AI workloads are always testing the limits of the infrastructure stack. Large-scale model training and inferencing with huge data volumes of training data on hundreds-thousands of GPUs make it a true engineering challenge. Singularity is a globally distributed, multi-tenant service that provides robust, cost-effective and competitive AI infrastructure (compute, networking and storage) for AI training and inferencing. By abstracting workloads from underlying infrastructure, Singularity creates a shared pool of resources that can be dynamically provisioned for full utilization of expensive GPU compute, and enabling data scientists to productively build, scale, experiment, and iterate their models on top of a robust, performant, scalable and cost-effective distributed infrastructure built for AI. In Singularity, we are constantly seeking to apply the best ideas from AI, ML, distributed systems, distributed databases, machine learning, information retrieval, networking, and security.
Qualifications
Required Qualifications:
3+ years of experience with coding in one of Python, C#, Java, C or C++
Experience working with the Linux operation system and Kubernetes cluster orchestration
Experience with improving service operations or engineering fundamentals
Excellent collaboration skills
A Master’s degree (or Bachelor’s degree with 4+ years of work experience equivalent) in computer science or a related field
At least 3 years of experience building and shipping production software or services
Preferred Qualifications:
Experience in development in the Kubernetes ecosystem
Experience in using / extending PyTorch / TensorFlow
Experience in developing distributed storage systems
Experience in building large scale cloud services, distributed systems, or operating systems
Experience programming GPUs (graphics processing units), CUDA/cuDNN/NCCL
Responsibilities
Deliver a robust container orchestration platform for Singularity
Design and build the scheduling sub-system that is responsible for delivering on the SLAs for AI training and inferencing workloads
Design and build control plane APIs for creation and management of customer, job and model metadata
Deliver node management, fault detection and node repair as a service to improve job/model reliability
Deliver world-class monitoring systems and telemetry pipelines to enhance service and job observability for both end-users and operators.
Codify security and compliance requirements by building and strengthening system defenses against malicious attacks and exploits
Leverage performance and profiling tools to identify hot spots and bottlenecks across hardware and software boundaries: from CPU, GPU, microcode, OS, networking code and drive end-to-end job performance
Join Our Telegram Channel For Regular Updates:- https://t.me/jobtankindia
Apply Here:- Application Link