Unleashing ML Innovation at Spotify with Ray – Spotify Engineering : Spotify Engineering
Unleashing ML Innovation at Spotify with Ray
As the field of machine learning (ML) continues to evolve and its impact on society and various aspects of our lives grows, it is becoming increasingly important for practitioners and innovators to consider a broader range of perspectives when building ML models and applications. This desire is driving the need for a more flexible and scalable ML infrastructure.
At Spotify, we strongly believe in a diverse and collaborative approach to building ML applications. Gone are the days when ML was the domain of only a small group of researchers and engineers. We want to democratize our ML efforts such that contributors of all backgrounds, including engineers, data scientists, and researchers, can leverage their unique perspectives, skills, and expertise to further ML at Spotify. As a result, we expect to see an increase in well-represented ML advancements at Spotify in the coming years — and the right infrastructure will play a crucial role in supporting this growth.
Background
Spotify founded its machine learning (ML) platform in 2018 to provide a gold standard for reliable and responsible production ML. As an ML platform team, we aim to empower our users to spend less time maintaining bespoke ML infrastructure and more time focusing on solving business problems through novel model development.
Our centralized infrastructure now serves over half of our internal ML practitioners and ML teams. Internal research has shown, however, that our platform tools aren’t currently perfectly suited for all dimensions of ML practitioners. While the majority of our ML engineers use our centralized tooling, fewer data and research scientists do. We believe solving the following user needs can help all dimensions of ML innovators at Spotify:
Spotify’s ML infrastructure today
Our goal for Spotify’s ML Platform has always been to create a seamless user experience for ML practitioners who want to take an ML application from development to production. In early 2020, our ML Platform expanded to cover the ML production workflow for Spotify’s ML practitioners with four core product offerings:
Lifecycle of an ML project at Spotify
ML at Spotify resembles a funnel. At the widest end, we have a big volume of ML activities undertaken by data and research scientists to quickly prove high-potential ideas. Their tasks, tools, and methods are diverse and heterogeneous and difficult to standardize — in fact, it’s suboptimal to standardize their methods at this point in the lifecycle. As the funnel narrows and high-potential ideas prove out, data engineers and ML engineers take over. Standardizing their tasks and tools is an optimization both from a user experience standpoint and from a business perspective; ML engineers can spend less time building redundant tooling, and our business benefits from having proven, reliable, and innovative ML ideas launched to production faster.
We built our platform for ML engineers first because their use cases and needs were easier to standardize. But that focus came at a cost: it meant less flexibility for innovation at the earlier stages of the lifecycle.
Transforming ML development at Spotify
In 2022, our team set out to refresh the two-to-three-year strategy and vision for Spotify’s ML Platform. A big component of that strategy was to better serve the needs of innovators focused on the earlier stages of the ML lifecycle and enable a seamless transition from development to production.
The next evolution of Spotify’s ML infrastructure
At Spotify, ML practitioners all share a similar ML user journey. They want to start their ML projects by prototyping on their local machines or in a notebook, and they need access to large computing resources like dozens of CPUs or GPUs. They like to easily create and scale end-to-end ML workflows in Python, access a diverse set of modern ML libraries, and seamlessly integrate with the rest of the Spotify engineering ecosystem with minimal code changes and infrastructure knowledge.
To better meet the needs of our users and improve productivity across the entire ML lifecycle, we need flexible infrastructure that meets the majority of our users where they already are. We need a platform that helps day-one Spotifiers feel productive — regardless of if they’re a data scientist in customer service testing ideas fast or an ML engineer on an advanced personalization team concentrated on hardening production workflows. Our current platform experience is heavily weighted towards a single user journey: an ML engineer using TensorFlow/TFX for supervised learning production applications. To better support our target market of a broader range of constituents, we need to lower the barrier to entry and embrace more diverse ML tooling while maintaining scalability and performance in end-to-end ML workflows.
Introducing Ray
After extensive prototyping and investigation, we believe Ray addresses those needs.
Ray is an open-source, unified framework for scaling AI and Python applications. It’s tailored for ML development with its rich ML ecosystem integration. It easily scales compute-heavy workloads such as feature preprocessing, deep learning, hyperparameter tuning, and batch predictions — all with minimal code changes. Ray is widely adopted across the ML industry. OpenAI cofounder and CTO Greg Brockman said at Ray Summit 2022, “We’re using [Ray] to train our largest models. So it has been very, very helpful for us in terms of just being able to scale up to a pretty unprecedented scale and to not go crazy.” With Ray, ML developers no longer need to completely change their code and framework of choice to achieve scale for production applications, easing the transition from local development to a distributed computing environment.
Incorporating Ray into the Spotify ecosystem
We built a centralized Spotify-Ray platform because we want our ML practitioners to solve ML problems and not have to devote their time to managing Ray or underlying infrastructure. The platform consists of server-side infrastructure, client-side SDK and CLI, and integrations with the rest of the Spotify ecosystem. We designed it to cater to the needs of all types of ML practitioners, not only ML engineers. We optimized for accessibility, flexibility, availability, and performance.
We wanted users to have an amazing onboarding experience with a gradual learning curve. We optimized for progressive disclosure of complexity, providing sensible defaults for common use cases and flexible abstractions over underlying Ray and Kubernetes complexity that accommodate both new users and “power users” alike. This lets ML practitioners focus on their business logic right away. With a single CLI command, users can create their own Ray cluster with preinstalled ML tools, ready-to-run notebook tutorials, VS Code server for in-browser editing, and SSH access.
Users can list, describe, scale, customize, and delete Ray clusters too.
Under the hood, we use Google Kubernetes Engine (GKE) and the open-source KubeRay operator. Our CLI creates a custom Kubernetes Ray cluster resource that tells KubeRay to create a new Ray cluster. Users start with a shared, playground namespace to learn and experiment with minimal setup. Once they’re ready, they create their namespace. Our multi-tenancy team management process grants permissions, configures resources, and manages contributors. It generates all Kubernetes resources based on a team configuration file and deploys them to the cluster to set up the namespace.
In addition to the CLI, we created a Python SDK with equivalent features. The SDK lets users programmatically manage their Ray clusters.
Flexibility
Users can easily make use of state-of-the-art ML libraries and select computing resources to support their workloads for research and prototyping. As a result of using Ray, our platform supports all the major ML frameworks like PyTorch, TensorFlow, and XGBoost. Computing resource configuration is abstracted in a unified and user-friendly way. If users don’t want the default computing resources, they can easily customize them. For example, they can request a specific type and number of GPUs.
We build on top of managed GKE’s availability instead of managing Kubernetes clusters ourselves. We isolate workloads by giving each Ray worker its own GKE node, and we isolate teams by giving each their own Kubernetes namespace.
We leverage GKE’s image streaming feature to speed up image pulls. We’ve decreased the time it takes to pull large GPU-based container images from several minutes to just a few seconds.
A Ray-based path to production
In the name of building a minimum viable version of Spotify-Ray, we chose to prioritize early-stage prototyping and experimentation — the “mouth” of the funnel, in other words. However, we see promise in Ray as the backbone of a powerful path to production for ML practitioners at Spotify. With Spotify-Ray’s native Flyte integration and high-level APIs in the works to streamline and accelerate canonical MLOps tasks, e.g., data loading, artifact logging, experiment tracking, and pipeline orchestration, we believe Ray can significantly shorten the time-to-production for ML applications at Spotify. We’re excited to work together with our internal ML practitioners to achieve this vision.
Use case: Graph learning for content recommendations
In a recent research project, Spotify’s Tech Research team, a previously underinvested ML Platform end user, experimented with using graph learning technologies for recommendations. Unlike past research projects that are typically prototyped with ad hoc tooling and then implemented for production scenarios, the graph learning implementation needed to be production-ready to quickly assess GNN for Spotify’s business use cases. Our ML researchers needed infrastructure that was flexible and easy to productionize quickly. This prompted Tech Research to use graph learning on Spotify-Ray to generate content recommendations.
Following promising offline results on internal datasets, the Tech Research team ran an A/B test to understand how GNN-based algorithms changed our home page’s “Shows you might like” recommendations. Conducting these A/B tests was challenging since the GNN workflows are different from typical ML workflows. Tech Research adopted Spotify-Ray for their infrastructural needs to overcome these challenges and implemented a set of components to train and deploy GNN models at scale.
Using the above components, this team built an end-to-end pipeline for generating show recommendations using GNN-based models and successfully launched an A/B test in less than three months, a feat that was extremely challenging for Tech Research in the past given the prior supported ML infrastructure. The A/B test resulted in significant metric improvements and improved user experience on our home page’s “Shows you might like.”
Looking ahead
Spotify’s ML practitioners’ demand for PyTorch has grown considerably, particularly for emerging use cases in the NLP and GNN spaces. We plan to use Ray to support and scale PyTorch to meet this growing demand and help our diverse users feel productive, no matter their role on the team.
While bringing in a new framework carries the risk of fragmentation, with better foundational building blocks in place, we can work toward creating a more flexible, representative, and responsible ML platform experience that comprehensively unleashes ML innovation at Spotify.
Our work bringing Ray into the Spotify ecosystem would not have been possible without the fantastic work of the ML Workflows team at Spotify, our teammates from ML Platform, and generous collaboration with the Anyscale team. Thank you to the individuals whose work made unleashing ML innovation at Spotify possible: Jonathan Jin, Mike Seid, Joshua Baer, Richard Liaw, Dmitri Gekhtman, Abdullah Mobeen, Maria Cipollone, Olga Ianiuk, Sara Leary, Grace Glenn, Omar Delarosa, Maisha Lopa, Shawn Lin, Andrew Martin, and Union.ai for their support on Flyte integration.
PyTorch, the PyTorch logo, and any related marks are trademarks of The Linux Foundation.
TensorFlow, the TensorFlow logo, and any related marks are trademarks of Google Inc.