tl; dr Today, we are doing a long awaited open source GUI for Metaflow. The Metaflow GUI Data allows scientists to monitor their workflow in real-time, track experiments, and view detailed logs and results for each task performed. The GUI can be expanded with plugins that allow the community to create integrations with other systems, custom visualization, and embed upcoming metaflow features directly into its perspective.
Metaflow is a full-stack framework for data science that we started developing on Netflix four years ago and which we made open source in 2019. This allows data scientists to define ML workflows, test locally, scale-out in the cloud, and produce in idiomatic Python code. Since open-sourcing, the Metaflow community has been growing rapidly: it’s now the 7th most active star-studded project on Netflix’s GitHub account, with around 4,800 stars. Outside of Netflix, Metaflow is used to power machine learning in the production of hundreds of companies across industries ranging from bioinformatics to real estate.
Since its inception, Metaflow has been a command-line-centric tool. This makes it easy for data scientists to publish even complex machine learning applications in Idiomatic Python, test them locally, or scale them in the cloud – all using their favorite IDEs and terminals. Following our culture of freedom and responsibility, MetFlow gives data scientists the freedom to choose the right modeling method, manages data and features flexibly and creates workflows easily and ensures that the resulting project performs responsibly and firmly in the production infrastructure.
As the number of ongoing projects and metaphors on Metaflow grow – some of which are very central to our business – our ML platform team has begun to receive an increasing number of support requests. Often, questions were of the nature “Can you help me understand why my flow takes so long to be effective” or “How to find logs of a model that failed last night.” Technically, Metaflow provides a Python API that allows the user to visit all the details such as a notebook, but it seems extra and unnecessarily tedious to write code in a notebook to answer such basic questions. After months of monitoring the situation, we begin to understand the type of new user interface that can meet the growing needs of our users.
A human-centered system by metaflow design. We consider our Python API and CLI to be an integral part of the overall user interface and user experience, focusing on making it easy to create product-ready ML projects from scratch alone. In our approach, Python code provides a highly expressive and productive user interface for expressing complex business logic, such as ML models and workflows. At the same time, CLI allows users to quickly execute specific commands and even automate simple actions. When it comes to complex, real-life development work, it will be difficult to achieve the same level of productivity in the graphical user interface.
However, there is a considerable lack of text UI when it comes to discovery and an overall idea of the state of the system. The questions we were hearing reflected this gap: We lacked a user interface that allowed users to find, very easily, quickly What’s going on In their metaflow project.
Netflix has a long history of developing innovative tools for observability, so when we started defining requirements for new GUIs, we were able to gain experience from previous GUIs created for other uses, as well as real-life user stories from Metaflow. We wanted users to strictly scope the GUI by focusing on a specific gap in the metaflow experience:
- GUI users should be allowed See what is flowing and performing And what is happening inside them. Significantly, we did not want to replace any of the Metaflow APIs or CLIs with GUIs – just to supplement them. This means that the GUI will be Read-only: All the work like writing code and starting execution should happen to the user IDE and terminal as before. We still didn’t need to create a model-monitoring GUI, which is a completely different problem domain.
- Will be GUI Aims at professional data scientists. Instead of a fancy GUI for demos and presentations, we wanted a serious productivity tool with a carefully thought out user workflow that would fit seamlessly into our data science toolchain. This requires attention to detail: for example, users should be able to copy and share a link to any view in the GUI, such as Slack, for easy collaboration and support (or to integrate with Metaflow Slack bots). And, there should be normal opportunities to navigate between the CLI, GUI, and notebook.
- GUI should be Scalable and chic: It will manage our existing repository of millions of runs, some of which have thousands of jobs without hiccups. Based on our experience with other GUIs operating on the Netflix scale, this is not a trivial requirement: scalability needs to be baked into the design from the outset. Lazy GUIs are difficult to debug and fix later, and they can have a significant negative impact on productivity.
- GUI should Integrate well with other GUIs. A modern ML stack contains many independent systems such as data warehouses, calculation levels, model serving systems and, in particular, notebooks. It should be possible to find run and interest jobs in the Metaflow GUI and use a task-specific view to go to other GUIs for more information. The landscape of our tools is constantly evolving, so we didn’t want to hardcode these links and views in the GUI. Instead, following Metaflow’s integration-friendly policy, we want to embed relevant information in the GUI as a plugin.
- Finally, we wanted to Minimize operational overhead Of GUI. In particular, under no circumstances will the GUI affect metaflow execution. The GUI backend should be a generic service, optionally sitting next to existing metaflow metadata services, providing a real-time view of the stored state. The frontend side should be easily extensible and maintainable, suggesting that we want a modern responsive app.
Since our ML platform team’s frontend resources were limited, we contacted CodeMate to help implement. This often happens in software engineering projects, the project takes longer than expected to complete, mostly because the problem of tracking and visualizing thousands of contemporary objects in a highly distributed environment is a surprisingly trivial problem (duh!). After countless repetitions, we are finally very happy with the results, which we have now used in production for a few months.
When you open the GUI, you’ll see an overview of both current and historical streams and runs, which you can group and filter in a variety of ways:
We can use this view for Test tracking: Metaflow automatically records each execution, so data scientists can track all their work using this view. Naturally, views can be grouped by user. They can tag their runs and filter views by tag, so they can focus on specific subsets of the test.
After you click a particular run, you’ll see all of its work in a timeline:
Understanding timeline view performance constraints, delivering task runtime, and finding failed tasks is extremely effective. At the top, you can see the global features of the run, such as its status, start time, parameters, and so on. You can click on a specific task to see more details:
This task view shows the logs generated by a task, its results, and optionally links to other systems relevant to the task. For example, if the task places a model on a platform serving a model, the view may include a link to a UI used to monitor microservices.
According to our requirements, the GUI should work well with the Metaflow CLI. To make this easier, there is a navigation element in the top bar where the user can copy-paste any one Pathspeak That is, metaflow is a path of any object in the universe, which is shown prominently in the CLI output. This way, the user can easily switch from CLI to GUI so that runs and tasks can be monitored in detail.
Although the CLI is great, it is challenging to visualize the flow. Each flow can be presented as a guided acyclic graph (DAG) and so the GUI provides a better way to visualize a flow. DAG views represent all the steps in a flow and how they relate. Each step may have a developer comment. They are colored to indicate the current state. Split steps are grouped by shaded boxes, while steps participating in a nozzle are grouped by a double shade box. Clicking on a step will take you to the task view.
Users of different organizations will probably have some special uses that are not directly supported. Extensible via the Metaflow GUI plugin API. For example, Netflix has a container orchestration platform called Titus. Users can configure tasks to use Titus to scale up or out. In the event of a failure, users will need to access their Titus container for more information, and in the task view, a simple plugin provides a link to resolve further issues.
We know that our user stories and requirements for the Metaflow GUI are not unique to Netflix. Several companies in the Metaflow community have requested GUIs for Metaflow in the past. To support the affluent community and invite third party contributions to the GUI, we are open-sourceing our monitoring GUI for Metaflow today!
You can find detailed instructions on how to set up the GUI here. If you want to see it work before you deploy the GUI, Outerbounds, a new startup founded by our former colleagues, has set up a public demo example of the GUI. Outerbounds also hosts an active slack community of Metaflow users where you can find support for GUI-related issues and share feedback and ideas for improvement.
With the new GUI, data scientists no longer have to fly blind. Instead of reaching out to a platform team for support, they can easily see for themselves the status of their workflow. We hope that Metaflow users outside of Netflix will benefit the GUI equally, and that companies will find creative ways to improve the GUI with new plugins.
For more information on GUI development process and inspiration, you can check out this recording of GUI Launch Meetup.