Emergence of the ML Model for Personalizing Home Content (and Learned) (Part 1): Spotify Engineering

November 15, 2021
Published by Annie Edmundson, Engineer

At Spotify, our goal is to connect the audience with the creators, and one way we can do that is to recommend quality music and podcasts on the home page. In this two-part blog series, we’ll talk about the lessons we’ve learned from creating ML models and using them to recommend diverse and fulfilling content to our audiences, and from creating ML stacks that serve these models.

Machine learning focuses on how we personalize the home page user experience and connect the audience with the most relevant developers. Like many referral systems, Spotify home page recommendations are driven by two steps:

Stage 1: Candidate Generation: The best albums, playlists, artists and podcasts are selected for each audience.

Step 2: Ranking: Candidates are ranked in the best order for each audience.

In the first part of this series, we will focus on the first stage – the machine learning solution that we created to personalize the content for the audience’s home pages and, in particular, the lessons we learned in creating, testing and deploying. This model.

Home @ Spotify

The home page contains cards – square items representing an album, playlist, etc. – and shelves – horizontal rows containing multiple cards. We create personalized content for listeners ‘home pages, algorithmically curating music and podcasts that are displayed on listeners’ home shelves. Some content is created through heuristics and rules and some content is manually curated by editors, while other content is created through predictions using trained models. We are currently in the process of producing several models, each strengthening the content curation for a different shelf, but in this post we will discuss the three models which include:

  • Podcast Model: The podcast predicts that a listener will probably listen to it Shows your choice Shelf
  • Shortcut model: The shortcut feature predicts the listener’s next familiar listening.
  • Playlist Model: Playlists predict that a new listener will probably listen to ৷ Try something else Shelf

Since we launched our first model to recommend content at home, we’ve worked to improve our ML stacks and processes to test and produce models more quickly and reliably.

Simplicity and automation road

Anyone who has contributed to the implementation of an ML model knows that moving a model from experimentation to production is not an easy task. There are numerous challenges in managing, testing and tracking the data that goes into a model, and monitoring and retraining the models. Although we’ve always tried to keep our ML infrastructure simple, and as close to the source of features as possible, it has become extremely easy for our squads to install and maintain models since we started.

At a higher level, an ML workflow can be divided into three main stages: 1) data management, 2) testing and 3) operationalization.

It is common to work repetitively in the training and evaluation phase until a final model version is selected as the best. This model is then placed in the production system and can begin to make predictions for the audience. Like most manufacturing systems, models (and the services / pipelines that serve them) should be closely monitored. To keep a model up to date (which is more important for some tasks than others; there is more to come), retraining and model version is the last step in our workflow. Significant changes have been made to this part of our stack and workflow since our first model – creating content listener batch predictions (offline) is likely to stream – so far, where all our models are served in real time. The image below shows where our machine learning stack started and where we are now:

Our current ML stack automates many of the processes involved in maintaining models (including online service): Our service infrastructure includes automated feature logging devices, with Scio pipelines designed to convert these features and Kubeflow pipelines for weekly retraining. We have applied data verification (as well as validity in subsequent training datasets) of our training and service features to ensure that our features are consistent and follow the same distributions during training and estimation. In our Kubeflow pipeline, we have components that check evaluation scores and automatically push the model into production if scores are above our threshold. With this stack, we monitor and alert the automated data verification pipeline, as well as the online deployment of our models – allowing us to handle any problems we encounter.

With a lot of effort and a lot of learning, our ML stack has evolved to make these processes more automated and more reliable, enabling us to quickly replicate our models and enhance our engineering productivity.

How we integrate training and service data

When we first start thinking about a problem, we always dig the data first – which data will be useful? What information is available? And then we really look closely at the data that will be used for the features, identify what is in the dataset and mark the edge of the data. We’re fairly confident about the content of the data used for our training features, as well as what the converted data looks like, but the features have been brought in and converted here Worship Time is a completely different story.

Batch training information and batch prediction

Historically, we have had a set of infrastructure for bringing and converting features during testing (training) and a different infrastructure for bringing and converting features for predicting (services).

Then we start predicting online (… with incorrect data).

When we changed the podcast model from batch offline prediction to real time service, we set up a new service that could support it – this new service had to bring features and transform, predict and respond to requests. The important part here is that feature processing and conversion were now in a different place than the corresponding training feature processing space. And, unfortunately, the models are like black boxes, so it is difficult to test the output, if not impossible. Some time ago, we discovered that we had modified one of the features of the model somewhat differently from the time it was served during training, resulting in potentially degrading recommendations – and there was no way to detect it, so it continued to happen for four months. Just think about this for a second. Such a simple part of our stack – at most, a few lines of code – went wrong and affected the recommendations produced by our model. Our short-term solution was to change a line of code in our predictive service which is causing the problem, but we knew in the long run that for both training and serving we either need to have a single source of data, or we need to make sure that data is generated Has been converted.

Implement a transformation to rule them all

Our first approach was that any feature processing and conversion should occur in the same code path, so that the training and service features are processed similarly. Taking the shortcut model as an example again, our goal was to get rid of the Python service that transformed the training features – this service was always running and constantly being tested, to see if it was a Monday; If so, it will request data from the required services (price-limited 5 requests / second) and convert them into features; Ideally, this would be applied as a pipeline, but we could not schedule and orchestrate it because the process takes more than 24 hours. There were many reasons why we wanted to move away from this method, but it was difficult to log features if the only data source for features was a different service (owned by a different squad). Using the features logging capability of our service infrastructure, we can automatically log the already converted features, which can then be used for training. At the moment, all our features for training and delivery are being converted by code in Java service. And we now use this feature logging for all our models to solve this problem, and because it reduces the amount of additional infrastructure we need to support.

But wait, we can do more by verifying our data

The second method we have adopted to ensure that there is no difference between our training and service features is to use TensorFlow Data Validation (TFDV) to compare training and service data schemas and feature distribution on a daily basis. The alerts we’ve added to our data verification pipeline allow us to detect significant differences in feature sets – it uses the Chebyshev distance metric, which compares distances between two vectors and can help alert us to flow in training and service features.

While we knew it was important to understand what was in our data, we quickly learned that it was easy to make mistakes when taking models into production because data often uses a different processing library. We didn’t expect much data discrepancies, but verifying and alerting issues lets us know if anything has changed and how the issue should be remedied.

Stay tuned for the second episode to take a closer look at how we evaluate our models using offline and online metrics, why we’re making recommendations, and why it’s so important to look at the challenges we’ve encountered on our journey to CI / CD. Model retraining.

Tags: machine learning

Source link

Related Articles

Back to top button