From model-centric to data-centric MLOps
Tags: AI/ML , AIML , Kubeflow , machine learning , MLOps
MLOps (short for machine learning operations) is slowly evolving into an independent approach to the machine learning lifecycle that includes all steps – from data gathering to governance and monitoring. It will become a standard as artificial intelligence is moving towards becoming part of everyday business, rather than an innovative activity.
Over time, there have been different approaches used in MLOps. The most popular ones are model-driven and data-driven approaches. The split between them is defined by the main focus of the AI system: data or code. Which one should you choose? The decision challenges data scientists to choose which component will play a more important role in the development of a robust model. In this blog, we will evaluate both.
Model-driven development focuses, as the name suggests, on machine learning model performance. It uses different methods of experimentation in order to improve the performance of the model, without altering the data. The main goal of this approach is to work on the code and optimise it as much as possible. It includes code, model architecture and training processes as well.
If you look deeper into this development method, the model-driven approach is all about high-quality ML models. What it means, in reality, is that developers focus on using the best set of ML algorithms and AI platforms. The approach also is a basis for great advancements in the AI space, such as the development of specialised frameworks like Tensorflow or PyTorch.
Model-centric development has been around since the early days of the discipline, so it benefits from widespread adoption across a variety of AI applications. The reason for this can be traced back to the fact that AI was initially a research-focused area. Historically, this approach was designed for challenging problems and huge datasets, which ML specialists were meant to solve by optimising AI models. It has also been driven by the wide adoption of open source, which allows free access to various GitHub repositories. Model-driven development encourages developers to experiment with the latest bits of technology and try to get the best results by fine-tuning the model. From an organisational perspective, it is suited for enterprises which have enough data to train machine-learning models.
When it comes to pitfalls, the model-centric approach requires a lot of manual work at the various stages of the ML lifecycle. For example, data scientists have to spend a lot of time on data labelling, data validation or training the model. The approach may result in slower project delivery, higher costs and little return on investment. This is the main reason why practitioners considered trying to tackle this problem from a different perspective with data-centric development.
As it is often mentioned, data is the heart of any AI initiative. The data-centric approach takes this statement seriously, by systematically interacting with the datasets in order to obtain better results and increase the accuracy of machine learning applications.
When compared to the model-centric approach, in this case, the ML model is fixed, and all improvements are related to the data. These enhancements range from better data labelling to using different data samples for training or increasing the size of the data set. This approach improves data handling as well, by creating a common understanding of the datasets.
The data-centric approach has a few essential guidelines that look after:
- Data labelling
- Data augmentation
- Error analysis
- Data versioning
Data labelling for data-centric development
Data labelling assigns labels to data. The process provides information about the datasets that are then used by algorithms to learn. It emphasises both content and structure information, so it often includes various data types, measurement units, or time periods represented in the dataset. Having correct and consistent labels can define the success of an AI project.
Data-centric development often highlights the importance of correct labelling. There are various examples of how to approach it; the key goal is avoiding inconsistencies and ambiguities. Below you can find an image that Andrew Ng offers as an example of data labels in practice. In this case, the labels are used for two adjectives: inconsistency and ambiguity.
Data augmentation for data-centric development
Data augmentation is a process that consists of the generation of new data based on various means, such as interpolation or explorations. It is not always needed, but in some instances, there are models that require a larger amount of data at various stages of the ML lifecycle: training, validation, and data synthesis.
Whenever you perform this activity, checking data quality and ensuring the elimination of noise is also part of the guidelines.
Error analysis for data-centric development
Error analysis is a process performed once a model is trained. Its main goal is to identify a subset that can be used for improving the dataset. It is a task that requires diligence, as it needs to be performed repeatedly, in order to get gradual improvements in both data quality and model performance.
Data versioning for data-centric development
Data versioning tracks changes that happen within the datasets, in order to identify performance changes within the model. It enables collaboration, eases the data management process and fastens the delivery of machine learning pipelines from experimentation to production.
When it comes to pitfalls, the data-centric method struggles mostly with data. On one hand, it can be hard to manage and control. On the other hand, it can be biased if it does not represent the actual population, leading to models that underperform in real life. Lastly, because of the data requirements, it can easily be expensive or suitable only for projects which have collected data for a longer period of time.
Model-centric and data-centric development with MLOps
In reality, both of these approaches are tightly linked to MLOps. Regardless of the option that data scientists choose, they need to follow MLOps guidelines and integrate their method within the tooling that they choose. Developers can use the same tool but have different approaches across different projects. The main difference could occur at the level of the ML lifecycle where changes are happening. It’s important to note that the approach will affect how the model is optimised for the specific initiative, so choosing it with care is important to position your project for success.
Charmed Kubeflow is an end-to-end MLOps tooling, designed for scaling machine learning models to production. Because of its features and integrations, it has the ability to support both model-centric and data-centric development. It is an open-source platform, which encourages contributions and represents the foundations of a growing MLOps ecosystem that Canonical is moving towards, with various integrations at various levels: hardware, tooling and AI frameworks.