Sr. Data Center Platform Engineer

AMD

Austin, TX

Job posting number: #7292054 (Ref:amd-55088)

Posted: October 31, 2024

Job Description


WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. 

AMD together we advance_



THE ROLE: 

AMD is looking for a senior platform engineer to join our growing team. As a key contributor you will be part of a leading team to drive and enhance AMD’s abilities to deliver the highest quality, industry-leading technologies to market.  

 

THE PERSON: 

The Software Platform Architecture (SPA) team has an open position for a Platform Engineer.  SPA is the hardware-accelerated, software-focused wing of the newly-formed Cluster Platform Engineering (CPE) team at AMD and rolls up through the Data Center GPU (DCGPU) business unit. This role will be responsible for helping to select, curate, design, automate, and document all software underpinning an entire full-stack AI-focused platform.  This work is not net-new code development but instead focused on choosing the right software properties and how data and operations flow through it to ease the adoption and operations of large-scale GPU-accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD. SPA works closely with the Site Reliability Engineering (SRE) and Data Center Operations (DCOps) teams who tackle day-to-day commissioning and operations of the clusters under CPE’s control.  SPA’s work is measured by how much we reduce the operational toil while increasing the rigor and repeatability of processes for the SRE and DCOps teams.  SPA has design responsibility for the full Day 0 – Day 2 software platform. 

 

 

KEY RESPONSIBILITIES: 

  • The Platform Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD.  AI applications are focused on those best-leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems.
  • Work with all CPE teams to validate that SPA’s platform designs are Day 0 – Day 2 ready and able to integrate with other teams’ workflows
  • Work with the Release Engineering team to automate the application of updates and system configuration management tools. 
  • Maintain tight interaction with the SRE team to continually improve how what SPA designs is integrated into an operational change process and cadence
  • Ensure that all applications and infrastructure elements expose/export telemetry that is centrally managed and used to guide the management of the entire platform
  • Write the glue-code necessary to connect systems to each other if no native mechanisms exist
  • Ensure all platform designs reflect Security as a core principle, with input to Policy, Guidelines, and participate in platform and project retrospectives/blameless post-mortems

 

PREFERRED EXPERIENCE: 

  • Experience in full-stack (infra, platform, application) multi-site, multi-region solutions at scale
  • Strong multi-distro Linux knowledge across deployment, configuration, and management
  • Cloud Native platform implementation
  • Kubernetes as application dial-tone all the way up through Service Mesh and multi-tenant application deployment and management
  • Strong knowledge of multiple virtualization and containerization technologies systems like KVM, Xen, and Kubernetes – OpenShift a bonus
  • Experience with automation platforms at scale using Ansible, Terraform / OpenTofu
  • Some experience with application and platform telemetry frameworks, such as OpenTelemetry
  • Strong networking knowledge with a primary focus on L3 and path-vector routing protocols
  • Experience with RDMA/RoCE and InfiniBand a plus
  • Demonstrated record of accomplishment of successfully building and delivering complex operational solutions at scale, with the ability to learn new systems quickly in a rapidly changing environment
  • Python ang Golang experience a plus
  • Platform message-bus (such as Kafka) experience

 

ACADEMIC CREDENTIALS: 

  • Bachelor’s or Master’s degree in Computer/Software Engineering, Computer Science, or related technical discipline 

 


#LI-RW1

#LI-HYBRID



At AMD, your base pay is one part of your total rewards package.  Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan. You’ll also be eligible for competitive benefits described in more detail here.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.





Apply Now

Please mention to the employer that you saw this ad on AiCareers.com

More Info

Job posting number:#7292054 (Ref:amd-55088)
Application Deadline:Open Until Filled
Employer Location:AMD
Santa Clara,California
United States
More jobs from this employer