Unveiling Michelangelo: Democratization of ML across Uber – Part 2/2

Let's dive into part 2 of how Uber's Michelangelo platform revolutionizes machine learning at scale, enabling seamless deployment and management and democratizing machine learning across Uber.

and

Sep 17, 2024

👋 Hey! This is Manisha Arora from PrepVector. Welcome to the Tech Growth Series, a newsletter that aims to bridge the gap between academic knowledge and practical aspects of data science. My goal is to simplify complicated data concepts, share my perspectives on the latest trends, and share what I have learned from building and leading data teams.

In Part 1 of this blog, we discussed Uber’s in-house ML platform, Michelangelo. Michelangelo was launched to standardize the ML workflows via an end-to-end system that enabled ML developers across Uber to easily build and deploy ML models at scale. Before Michelangelo, Uber struggled with the fragmented process of building and deploying machine learning models, hindered by the scale of its operations. Michelangelo was developed to address these challenges by providing an end-to-end system that standardized workflows across teams, enabling scalable and efficient ML operations. Initially focused on scalable model training and production deployment, the platform later evolved to enhance feature pipeline management and developer productivity, ensuring that Uber's ML capabilities could grow alongside its business.

Challenges with Michelangelo 1.0

Uber's Michelangelo 1.0 faced several challenges in its early stages:

Lack of comprehensive ML quality definition: There was no standardized way to assess the full spectrum of model quality. Teams mostly measured offline performance (AUC, RMSE) but ignored critical metrics like online performance, data freshness, and reproducibility. Additionally, a lack of ML project tiering led to uniform resource allocation, resulting in high-impact projects being under-prioritized.
Insufficient support for deep learning (DL) models: Uber's infrastructure favored tree-based models, creating difficulties in adopting advanced DL techniques. Teams like Maps ETA had to build their own DL toolkits, delaying progress.
Collaborative development issues: Michelangelo 1.0 wasn't designed for team collaboration, leading to difficulties with version control, centralized code repositories, and a lack of collaborative tools for UI and Jupyter Notebooks.
Fragmented tooling and developer experience: Multiple ML tools were developed across teams, creating a fragmented experience. Michelangelo 1.0 lacked unified UI and APIs, forcing developers to constantly switch between semi-isolated tools.

To address these challenges, Michelangelo 2.0 introduced a unified platform with four key themes: model quality and tiering, model iteration via Canvas, DL as a primary focus, and an integrated developer experience with MA Studio.

Michelangelo 2.0

The initial goal of Michelangelo was to bootstrap and democratize ML at Uber. By the end of 2019, most lines of business at Uber had integrated ML into their products. Subsequently, Michelangelo’s focus started shifting from “enabling ML everywhere” to “doubling down on high-impact ML projects” so that developers could uplevel the model performance and quality of these projects to drive higher business value for Uber. Given the complexity and significance of these projects, there was a demand for more advanced ML techniques, particularly Deep Learning. Further, many different roles like data scientists and engineers were often required to collaborate and iterate on models faster, as shown below. This posed several challenges for Michelangelo 1.0, as listed below.

ML lifecycle is iterative and collaborative with many different roles.

Architecture of Michelangelo

Michelangelo 2.0 is built around four key pillars, focusing on a modular, plug-and-play architecture to integrate both in-house and third-party components. It aims to improve the development and production experience for applied scientists and ML engineers by enhancing collaboration, reusability, and compliance.

Key architectural design principles include:

Prioritizing high-impact ML use cases through project tiering, while offering self-service for long-tail use cases.
Supporting both core workflows for typical ML needs and bespoke workflows for advanced use cases like deep learning.
Adopting a plug-and-play approach, but limiting the managed solution to a subset for the best user experience.
Taking an API-first approach, with UI-driven features for fast iteration and visualization, while supporting code-driven model iteration.
Codifying best ML practices such as safe model deployment, automatic retraining, and feature monitoring within the platform.

High-level concepts of Michelangelo 2.0 Architecture.

The system is structured into three planes:

Control plane: Manages APIs and the lifecycle of system entities, following Kubernetes™ conventions.
Offline data plane: Handles big data processing for model training, evaluation, and batch inference, leveraging tools like Ray™ or Spark™.
Online data plane: Manages real-time inference, feature serving, and near-real-time feature computation for online prediction.

Michelangelo 2.0's control plane uses Kubernetes™ Operator design for modularity, supporting UI and code-based workflows through a consistent, declarative API. This structure simplifies engineering complexity and reduces dependency on external infrastructure.

Detailed system design of Michelangelo including offline, online and control planes.

Model Quality and Project Tiering

Developing and maintaining a production-ready ML system is a complex process, involving multiple stages such as feature engineering, training, evaluation, and serving. However, a key challenge lies in the lack of comprehensive quality measurement across the ML model lifecycle, leaving developers with limited visibility into key performance indicators and hindering organizational decision-making.

Example ML quality dimensions (in yellow) in a typical ML system.

To address this, Uber introduced the Model Excellence Score (MES) framework. MES measures and monitors essential quality dimensions, including model accuracy, freshness, and prediction quality, across all stages of the lifecycle. Inspired by the Service Level Agreements (SLA) used in microservices management, MES ensures that ML models adhere to rigorous quality standards, while tracking compliance and visualizing performance metrics for better organizational oversight.

Complementing MES is the ML Project Tiering Scheme, which classifies projects into four tiers based on their business impact. Tier 1 projects, such as ETA calculations and fraud detection, receive the highest priority due to their direct influence on core business operations. In contrast, tier 4 projects are more experimental with less immediate business impact. This tiering system helps prioritize resources, enforce best practices, and ensure that high-impact ML projects receive the necessary attention and investment.

Model iterations as code

In 2020, Uber launched Project Canvas to enhance developer productivity, improve collaboration, and elevate the quality of ML applications by integrating software engineering best practices into the ML lifecycle. Canvas standardized model development using version control, Docker containers, CI/CD tools, and predefined frameworks. Key components include:

ML Application Framework (MAF): Customizable ML workflow templates for complex techniques like deep learning.
ML Monorepo: Centralized code repository with robust version control.
ML Dependency Management: Ensures consistent project environments with Bazel and Docker builds.
ML Continuous Integration/Delivery (CI/CD): Automates deployment of models to production.
ML Artifact Management: Tracks and stores ML models, datasets, and evaluation reports.
MA Assistant (MAA): Michelangelo's AutoML solution for architecture search and feature optimization.

With these tools, Canvas empowered teams to test models locally and iterate rapidly in production, ensuring a smoother and more efficient ML development process.

Canvas: Streamlining end-to-end ML developer experience.

Deep Learning as a first-class platform citizen

Adopting advanced deep learning techniques, such as custom loss functions and incremental training, posed challenges for Uber, particularly due to the lack of support in Michelangelo 1.0. Unlike traditional models, DL models require sophisticated infrastructure, including feature transformation, model training, serving, and GPU resource management.

Feature Transformation

Michelangelo 1.0 used a Spark-based DSL for feature transformation, which was effective for traditional models but limited for DL models requiring GPU support. In Michelangelo 2.0, a new DL-native transformation solution was introduced, allowing users to apply transformations via Keras or PyTorch operators. This ensures low-latency serving on GPUs by combining the transformation and inference graphs.

Model Training

Michelangelo 2.0 supports both TensorFlow and PyTorch for large-scale DL model training, utilizing Horovod for distributed training. Key improvements include:

Ray-based trainers for better scalability, replacing the Spark-based training framework.
Elastic Horovod for dynamic scaling and fault tolerance, ensuring minimal disruption during training.
Incremental training for resource-efficient retraining, improving dataset coverage without starting from scratch.
A declarative training pipeline in Canvas, allowing customization of model components like loss functions and estimators.

Example training pipeline in Canvas for a deep learning model.

Shameless plug:

Machine Learning Engineering Bootcamp

Learn the intricacies of designing and implementing robust machine learning systems. This course covers essential topics such as ML architecture, data pipeline engineering, model serving, and monitoring. Gain practical skills in deploying scalable ML solutions and optimizing performance, ensuring your models are production-ready and resilient in real-world environments.

Join Our Waitlist

Model Serving

Uber's tier-1 ML projects require ultra-low-latency serving. To meet this need, Michelangelo 2.0 replaced the deprecated Neuropod engine with Triton, a GPU-optimized serving engine from Nvidia. Triton supports multiple frameworks, including TensorFlow, PyTorch, and XGBoost, abstracting complexity for users while maintaining performance.

GPU Resource Management

Uber manages over 5,000 GPUs across on-premise data centers and cloud providers like OCI and GCP. To maximize utilization, Michelangelo 2.0 incorporates elastic CPU and GPU resource sharing, allowing teams to use idle resources opportunistically. A job federation layer across Kubernetes clusters enhances portability and simplifies cloud migration.

With these advancements, Uber significantly increased DL adoption across its business lines, particularly for tier-1 projects. For example, the DeepETA model, trained on over one billion trips, highlights the scale and impact of DL at Uber, with deep learning now powering over 60% of Uber’s tier-1 ML projects.

MA Studio – One unified Web UI tool for everything ML @ Uber

To streamline the ML developer experience, Michelangelo (MA) Studio was developed to unify all platform capabilities into a single user journey, improving productivity through a redesigned UI and UX. It simplifies every step of the ML lifecycle—from feature preparation and model training to deployment and performance monitoring—within one platform.

MA Studio project landing page covering the end-to-end ML development life-cycle.

Key Features of MA Studio:

Version Control and Code Review: All ML code and configurations are version-controlled, with mandatory code reviews, ensuring high-quality production models.
Modern Model Deployment: Offers safe, incremental rollouts, automatic rollback, and runtime validation.
Unified ML Observability: Integrates model and feature monitoring, consistency checks, and Model Excellence Score (MES).
Lifecycle Management: Simplified management of ML entities like models, datasets, and pipelines through an intuitive UI.
Enhanced Debugging: Accelerated recovery from ML pipeline failures with advanced debugging tools.

MA Studio, along with Canvas, serves as Uber’s two primary ML tools. Studio covers standard workflows like XGBoost model training with minimal coding, while Canvas is designed for more advanced use cases such as deep learning (DL) training. Both platforms ensure seamless execution, deployment, and monitoring of models, with all code changes subject to review, elevating the overall quality of Uber’s ML applications.

Generative AI and LLM Integration at Uber

Recent advancements in generative AI, especially large language models (LLMs), are set to transform how machines interact with natural language. Teams at Uber are exploring LLMs to boost internal productivity, automate business processes, and enhance user experience, while addressing LLM-related challenges.

Three categories of generative AI use cases at Uber

For generative AI development, Uber teams use both external LLMs (via third-party APIs) and internally hosted open-source LLMs. External models excel in tasks requiring general knowledge, while open-source models are fine-tuned using Uber's proprietary data, achieving high accuracy and lower latency at reduced costs. To streamline this process, the Gen AI Gateway was developed, offering a unified interface for accessing both external and internal LLMs while ensuring privacy, security, and cost management. Key capabilities include:

Logging and Auditing: Comprehensive tracking for accountability.
Cost Guardrails and Attribution: Managing expenses and alerting for overuse.
Safety & Policy Compliance: Ensuring adherence to internal guidelines.
PII Redaction: Safeguarding user data before interacting with external LLMs.

To accelerate generative AI development, Michelangelo now supports full LLMOps capabilities, including fine-tuning, prompt engineering, deployment, and performance monitoring. The key components are:

Model Catalog: A repository of pre-built LLMs (e.g., GPT-4, Llama2) that can be fine-tuned or deployed for serving.
LLM Evaluation Framework: Allows comparison of different LLM approaches (e.g., in-house vs. third-party fine-tuned models) and tracks prompt/model improvements.
Prompt Engineering Toolkit: A centralized tool for creating, testing, and version-controlling prompts.

To support cost-effective fine-tuning and low-latency serving, Michelangelo's training and serving stack includes:

Integration with Hugging Face: Uses Ray-based trainers for LLMs, with fine-tuned models stored in Uber’s repository.
Model Parallelism with Deepspeed: Enables training larger models beyond GPU memory constraints.
Elastic GPU Resource Management: Ray clusters on powerful GPUs enhance LLM training and future scalability through cloud GPUs.

These enhancements empower Uber’s teams to build LLM-powered applications, with ongoing advancements to be shared soon.

Conclusion

Uber's ML platform, Michelangelo, has been a driving force behind the company’s business transformation over the past eight years. Its journey can be divided into three phases: the foundational phase (2016-2019) focused on predictive ML, a deep learning phase (2019-2023), and the recent venture into generative AI starting in 2023.

Key lessons from building Michelangelo include the importance of a centralized ML platform for boosting development efficiency, especially in larger companies. A well-organized structure, with a central ML platform team and embedded data scientists/ML engineers, is essential for smooth operations.

Offering both UI-based and code-driven workflows is crucial for meeting diverse developer preferences while maintaining a modular platform architecture ensures flexibility for integrating new technologies. The combination of high-level abstraction layers for standard workflows and low-level access for advanced users has proven highly effective.

Uber’s experience also shows that deep learning should be applied selectively, as traditional models like XGBoost can sometimes deliver better performance and lower costs. A clear ML project tiering system helps allocate resources efficiently and maintain focus.

Michelangelo’s mission is to empower developers to rapidly build and iterate high-quality ML applications at scale, driving innovation, collaboration, and a strong ML culture at Uber.

If you liked this newsletter, check out my upcoming courses:

Product Data Science
Master Product Sense and AB Testing, and learn to use statistical methods to drive product growth. I focus on inculcating a problem-solving mindset, and application of data-driven strategies, including A/B Testing, ML and Causal Inference, to drive product growth.

Check Out Product DS Course

AI/ML Projects for Data Professionals
Gain hands-on experience and build a portfolio of industry AI/ML projects. Scope ML Projects, get stakeholder buy-in, and execute the workflow from data exploration to model deployment. You will learn to use coding best practices to solve end-to-end AI/ML Projects to showcase to the employer or clients.

Check Out AI/ML Projects Course

Machine Learning Engineering Bootcamp
Learn the intricacies of designing and implementing robust machine learning systems. This course covers essential topics such as ML architecture, data pipeline engineering, model serving, and monitoring. Gain practical skills in deploying scalable ML solutions and optimizing performance, ensuring your models are production-ready and resilient in real-world environments.