Pinterest MLEnv: Empowering Machine Learning at Scale

Discover how Pinterest's MLEnv empowers engineers to deploy scalable machine learning models with ease.

and

Sep 04, 2024

👋 Hey! This is Manisha Arora from PrepVector. Welcome to the Tech Growth Series, a newsletter that aims to bridge the gap between academic knowledge and practical aspects of data science. My goal is to simplify complicated data concepts, share my perspectives on the latest trends, and share my learnings from building and leading data teams.

Introduction:

Pinterest’s mission is to bring everyone the inspiration to create a life they love. The company relies on an extensive suite of AI-powered products to connect over 460 million users to hundreds of billions of Pins, resulting in hundreds of millions of ML inferences per second and hundreds of thousands of ML training jobs per month, managed by just a couple of hundred ML engineers.

In 2021, machine learning at Pinterest was siloed, with over 10 different ML frameworks relying on various deep learning frameworks, versions, and boilerplate logic to connect with the ML platform. This fragmentation became a significant bottleneck for ML innovation at Pinterest, as the engineering resources spent by each ML team to maintain their own ML stack were immense, and there was limited knowledge sharing across teams.

To address these challenges, Pinterest introduced MLEnv—a standardized ML engine now leveraged by 95% of ML jobs at Pinterest, a substantial increase from less than 5% in 2021. Since the platform's launch, Pinterest has observed a 300% increase in the number of training jobs, achieved a world-class 88 Net Promoter Score (NPS) for MLEnv, and seen a 43% increase in the overall ML Platform NPS. This shift has significantly altered the landscape for ML innovations, delivering aggregate gains in Pinner engagement on the order of mid-double-digit percentages.

Motivation behind MLEnv

When the project began, Pinterest's machine learning development was highly fragmented, with each team managing its own unique ML stack. This lack of standardization created significant barriers to productivity and innovation, making it challenging for both ML engineers and platform engineers to work efficiently.

For ML engineers, maintaining their environments was a constant struggle. Each team was responsible for code quality, CI/CD pipelines, and runtime stability, which led to significant overhead. The complexity of integrating essential tools and frameworks, such as MLFlow and Pinterest's internal ML platforms, added to their workload and slowed down development. Additionally, developing cutting-edge ML models required extensive, redundant efforts across teams, especially with the rise of large language models and generative AI. These tasks were often performed in isolation, with each team duplicating work without the benefits of shared knowledge or tools.

Platform engineers faced their own set of challenges. The diversity of ML stacks across the organization made it difficult to create standardized tools, limiting the platform team's ability to add value efficiently. Supporting multiple deep learning frameworks, such as TensorFlow and PyTorch, stretched platform resources thin, increasing development time and complexity. Furthermore, implementing software and hardware upgrades across teams proved challenging, leaving many stuck on outdated versions and limiting the ability to leverage the latest technologies.

This fragmented approach significantly hindered Pinterest’s ability to innovate and scale its ML capabilities effectively, highlighting the need for a more unified and standardized ML environment.

Pinterest’s MLEnv: Empowering Machine Learning at Scale

Pinterest’s MLEnv is a powerful platform designed to streamline and enhance the machine learning (ML) workflow, enabling teams to build, deploy, and manage models at scale. This environment was crafted to address the unique challenges Pinterest faces with its vast amounts of data and the need for personalized user experiences. MLEnv serves as the backbone of Pinterest’s ML infrastructure, supporting the rapid development and iteration of models that drive everything from content recommendations to ad targeting.

Seamless Integration and Scalability

At the core of MLEnv is its ability to integrate seamlessly with Pinterest’s existing data infrastructure, providing a unified platform where data scientists and engineers can collaborate efficiently. The environment offers robust tools for data preprocessing, feature engineering, and model training, all within a scalable architecture that can handle the demands of Pinterest’s diverse and growing user base. By standardizing the ML process, MLEnv not only improves productivity but also ensures consistency and reliability across the various models deployed on the platform.

Supporting Experimentation and Iteration

One of the standout features of MLEnv is its support for experimentation and iteration. Data scientists can quickly test and refine models, leveraging real-time feedback to make informed decisions. This iterative approach is crucial for Pinterest, where user preferences and behaviors are constantly evolving. MLEnv’s flexible design allows teams to adapt their models to changing data patterns, ensuring that the platform continues to deliver relevant and engaging content to its users.

Fostering Collaboration Across Teams

In addition to its technical capabilities, MLEnv fosters a culture of collaboration within Pinterest. By bringing together data scientists, engineers, and product teams on a single platform, it encourages cross-functional communication and innovation. This collaborative environment is essential for tackling the complex challenges that come with personalizing content for millions of users worldwide. MLEnv not only powers Pinterest’s current ML efforts but also lays the groundwork for future advancements in the field.

The Golden Age of ML at Pinterest

Following the general availability of MLEnv in late 2021, Pinterest entered a transformative period of rapid advancements in both machine learning (ML) modeling and platform capabilities. This era, often referred to as the "Golden Age of ML" at Pinterest, saw significant improvements in recommendation quality and the platform's ability to serve more personalized and inspiring content to its users. MLEnv played a pivotal role in this transformation, enabling Pinterest’s ML engineers to develop and deploy cutting-edge models with unprecedented speed and efficiency.

Shameless plug:

Machine Learning Engineering Bootcamp

Learn the intricacies of designing and implementing robust machine learning systems. This course covers essential topics such as ML architecture, data pipeline engineering, model serving, and monitoring. Gain practical skills in deploying scalable ML solutions and optimizing performance, ensuring your models are production-ready and resilient in real-world environments.

Join Our Waitlist

Accelerating ML Development Velocity

The introduction of MLEnv resulted in a dramatic increase in ML development velocity across Pinterest. By offloading much of the boilerplate engineering work and providing easy access to a comprehensive set of ML tools, MLEnv empowered engineers to focus on innovation rather than infrastructure. The platform’s intuitive interface and advanced capabilities have been game changers, allowing engineers to quickly develop, test, and deploy state-of-the-art ML models.

The impact of MLEnv on the productivity and satisfaction of Pinterest’s ML developers has been profound. The platform maintains an impressive Net Promoter Score (NPS) of 88, contributing to a 43% improvement in the overall ML Platform NPS. In some organizations within Pinterest, the NPS improved by as much as 93 points after the full rollout of MLEnv. As a result, teams are now able to run multiple times more ML jobs and take models to online experimentation in days rather than months, significantly enhancing the pace of innovation.

ML Platform 2.0: A Unified Environment for Growth

MLEnv also marked a significant shift for Pinterest’s ML Platform team, allowing them to focus on building standardized tools and cutting-edge capabilities within a single, unified environment. This focus on a singular ML environment enabled the team to reduce maintenance overhead and drive adoption of new functionalities at an unprecedented scale. One such example is the Training Compute Platform (TCP), Pinterest’s in-house distributed training platform. Before MLEnv, the team struggled with maintenance and adoption due to the diverse ML environments they had to support. However, with the unification brought by MLEnv, the team was able to streamline their efforts, resulting in explosive growth in the number of jobs on the platform and the rapid implementation of advanced features like distributed training and automated hyperparameter tuning.

Similarly, MLEnv revolutionized the ML serving platform at Pinterest, leading to a 100x improvement in serving efficiency through GPU serving. This achievement, which would have been nearly impossible under the previous paradigm, was made feasible by the unified environment MLEnv provided. By enabling collaboration between the Advanced Technology Group (ATG) and the ML Serving platform team, Pinterest was able to launch a groundbreaking project within six months, showcasing significant business metric improvements and enabling other major ML projects to scale rapidly.

GPU Serving requires both ML modeling and server architecture side optimizations

A Paradigm Shift in ML Development

MLEnv ushered in a new era at Pinterest where the lines between ML and ML Platform engineers blurred, fostering a unified goal of advancing Pinterest’s ML capabilities. Successful modeling architectures and innovations now propagate quickly across the platform, with teams able to experiment with and implement proven solutions in a matter of days. This rapid dissemination of successful models has had a significant impact on the platform’s ability to deliver personalized content and drive engagement.

The new development paradigm also encouraged greater collaboration among teams, leading to cross-functional efforts that target fundamental improvements in ML training and serving efficiency. Pinterest now has dedicated workgroups focused on advancing sophisticated ML model architectures, such as large embedding tables and graph convolutional neural networks. Additionally, teams contribute to the ML framework by building standardized tools like feature importance through integrated gradients and ML training orchestration frameworks, which are shared across the organization. This collaborative approach has not only enhanced Pinterest’s ML capabilities but also laid the groundwork for future innovations at scale.

Conclusion: The Impact of Standardization on Pinterest's ML Evolution

The standardization of machine learning at Pinterest through MLEnv has been a transformative force, driving significant improvements in ML developer velocity and sparking numerous innovations across the business. By offloading system and infrastructure challenges to a unified ML engine, Pinterest’s ML engineers can now focus solely on modeling improvements. This has led to significant enhancements in recommendation quality and personalized content, with teams experiencing multiple times growth in the number of ML jobs and accelerating online experiments from months to just days. Platform engineers have similarly benefited, concentrating on building and iterating on a single, cutting-edge ML stack, resulting in world-class outcomes, including a platform Net Promoter Score (NPS) of 88 and a 43% increase in overall ML Platform NPS.

The cultural shift driven by MLEnv has also fostered unprecedented collaboration across teams. Innovations in ML modeling and infrastructure, once developed by individual teams, now propagate quickly across Pinterest’s product surfaces, ensuring that advancements are widely shared and adopted. This collaborative environment has accelerated the pace of innovation, significantly improving business metrics and enhancing the platform’s ability to deliver more relevant and inspiring content to users. As Pinterest continues to evolve its ML capabilities, the future promises even greater possibilities for its Pinners, fueled by the ongoing advancements in machine learning.

If you liked this newsletter, check out my upcoming courses:

Product Data Science
Master Product Sense and AB Testing, and learn to use statistical methods to drive product growth. I focus on inculcating a problem-solving mindset, and application of data-driven strategies, including A/B Testing, ML and Causal Inference, to drive product growth.

Check Out Product DS Course

AI/ML Projects for Data Professionals
Gain hands-on experience and build a portfolio of industry AI/ML projects. Scope ML Projects, get stakeholder buy-in, and execute the workflow from data exploration to model deployment. You will learn to use coding best practices to solve end-to-end AI/ML Projects to showcase to the employer or clients.

Check Out AI/ML Projects Course

Machine Learning Engineering Bootcamp
Learn the intricacies of designing and implementing robust machine learning systems. This course covers essential topics such as ML architecture, data pipeline engineering, model serving, and monitoring. Gain practical skills in deploying scalable ML solutions and optimizing performance, ensuring your models are production-ready and resilient in real-world environments.