Boost Python models and pipelines with rubicon-ml
Authored by Sri Ranganathan, director of machine learning engineering, and Ryan Soley, senior machine learning engineer
How to Standardize the Model Development Lifecycle with Rubicon-ML from Capital One
Python is a great data science and machine learning (ML) language. Still, it can take time to get your models and pipelines production-ready.
We wanted to share a little about how an open-source tool developed at Capital One has been helping Python developers save time and energy for years.
What is rubicon-ml, and how does it work?
Rubicon-ml is a data science tool that makes it easy to train and deploy machine learning models. The name comes from the legend of Caeser crossing the Rubicon, which today is synonymous with “passing the point of no return.” It signifies that we are committed to a repeatable and auditable model development process using the library.
With rubicon-ml, the Python model development lifecycle is standardized, allowing developers to capture and store model training and execution information, including parameters and outcomes, in a repeatable and searchable manner. For developers and stakeholders, the git integration ensures that inputs and outputs are linked directly to the model code that produced them, ensuring full auditability and reproducibility. Moreover, the dashboard makes it easy to discover, filter, visualize, and share recorded work while you experiment.
Rubicon is lightweight in that much of its work relies on other open source tools. Being open source itself (and powered by these integrations of familiar ML tools), it can be used by one and all and is aligned with a structure that most data scientists are familiar with, so they don’t have to do all the “grunt work” of piecing together a solution for filtering, visualizing, and sharing recorded work from ML experiments.
Reducing the tedium and pitfalls of ML model metadata tracking
At a 10,000-foot level, rubicon appears to be a simple logging library; but it’s so much more than that. Any team can throw something together to write metadata to a system. Still, these thrown-together solutions often cause issues down the road (for example, when a company’s model risk team needs to review ten of the same model that has logged ten different schema’s worth of metadata). Rubicon’s goal is to unite that in the most lightweight, straightforward, and, thus, versatile way possible.
Every ML engineer has had the experience of running the same experiment (what feels like) a billion times in the model development life cycle. It’s an extremely tedious process: Defining your parameters. Logging how the model behaved (the model performed [fill in the blank] when I used X, Y, and Z parameters as inputs, and this was the status of the code at that particular point in time).
All these tasks, done manually, are exceedingly tedious yet require great attention to detail. It’s very easy to commit mistakes.
With rubicon, ML pros can focus on the model while the tool takes care of the rest. It lets them do the job they are good at, absorbing the engineering burden of logging, sharing, visualization, and more.
How rubicon-ml leverages existing tools for logging, sharing, and visualization
At a high level, the rubicon-ml library has three main components: logging, sharing, and visualization. Each of those is based on one core external library. While the library is structured in such a way as to be approachable from the standpoint of general ML best practices, it’s very flexible. Users can define their hierarchy however they want via the logging library.
Logging
The logging library is built on top of an open source library called File System Specification, or fsspec for short. Fsspec is heavily leveraged as the backbone of popular libraries like pandas and Dash. It provides batch-like file system operations and a unified way to interface with many different backend types like S3 buckets, Azure, Google, and local file systems.
Moreover, S3 logging can be performed within rubicon and local file systems, which extends to mounted network file systems and memory for exploratory work.
Rubicon users will find in the fsspec logging docs that, beyond those more popular backend choices, there are at least a dozen others that can be plugged in. This is because each of the different layers of the rubicon library was designed independently of the others so that they can be mixed and matched, whether that be different front ends (like a given Python library) or different backends.
Sharing
When managing machine learning projects, many professionals end up with a total volume of experiments in the order of thousands or even tens of thousands. When they want to share these experiments, team members who may be performing reviews (Model Risk Offices and the like) generally don’t need to see all of them. Only a few will even have helpful or relevant information. So, a customizable and concise sharing process becomes essential when communicating experiments and results with collaborators.
Rubicon’s sharing functionality leverages a library called intake, which condenses disparate data sources into one singular YAML file for easier sharing.
Essentially, we leverage intake to collect the minimum amount of necessary metadata to point people back to the experiments that the experimenters deem essential.
Visualization
Visualizations for rubicon-ml are all built on Dash and Plotly. They provide a Python-compatible way to define the UI for users. Rubicon’s visualizations are all intended to be run locally within the logging library, pulling data from wherever those data sources may reside and projecting them to wherever the user is running the UI, allowing for an identical visualization experience across multiple devices. Teams can shed the burden of hosting their UI; instead, pulling locally as long as they reference the same data source.
Dash itself has abundant documentation for how to dockerize Python code written in Dash and how to host it.
How rubicon-ml handles integrations
The ideal future state of rubicon is one of seamless integration with any library or tool that data scientists and ML model engineers find useful.
Currently, there are two ways in which rubicon interacts with open source libraries. First, there are those at the core of rubicon that power logging, sharing, and visualization.
But, there are also ways in which rubicon integrates into other libraries at a higher level. Take scikit-learn, for example. While it’s not core to rubicon’s functionality, we have built-in accommodations for its basic features.
Essentially, instead of using scikit to build rubicon, we’re building rubicon into scikit for a native logging experience, saving users time and energy and avoiding the headache of having to build a large number of logging statements into their code base.
How does rubicon-ml fit into the broader work of the Open Source Program Office and ML at Capital One at large?
In 2015, Capital One established the Open Source Program Office to set standards around process, technical review, licensing, compliance, risk, legal, communications, and other elements. The Open Source Program Office works to maintain and improve the quality of the company’s open source projects. As part of this effort, they collaborate with various teams across the company to ensure that open source is used effectively and efficiently. One such team is the machine learning team, which uses open-source tools (among other proprietary and enterprise tools) to build and train models.
The rubicon-ml project was created to help boost the performance of Python models and pipelines. It is designed to work with various machine learning frameworks, making it easy to integrate into existing workflows. The project has been heavily tested and is used in production by many of Capital One’s machine learning teams.
The Open Source Program Office is excited to continue working on rubicon-ml and other projects that will help Capital One’s machine learning efforts, like this one, be more successful.
What’s next for rubicon-ml?
First, because we open source our general library via GitHub, rubicon-ml is always improving. Please contribute to our library and roadmap and feel free to request features. We contribute your feedback and learnings back to rubicon in a more general form, improving the product for everyone who uses it.
Second, we want to hear from you if you’re a data scientist or a model developer! If there are any libraries that you regularly use, we’d love to know. We’re always looking for more libraries to seamlessly integrate into rubicon (like we’ve done with scikit) so that you can automatically get all the value of rubicon with very little intervention required up front.
The Featured Blog Posts series highlights posts from partners and members of the All Things Open community leading up to ATO 2023.