4 Ways How to Expose Your ML Model to the World
The problem
Your ML team has produced the model after weeks of tuning with high scoring results. For fast prototyping team has used Jupyter Notebook to train the model. You need to deploy and serve and integrate the model in production with proper monitoring and scalability. Let's discuss how to do that.
Options considered
There are lots of great strategies for deploying ML models, yet there is no one solution that fits all situations. The right choice really depends on:
- Kind of data model handles (tabular, images, text)
- Size of dataset (small enough to fits in memory)
- Use case (quick analysis, research, production)
- Team expertise (beginner vs. expert)
- Production Infrastructure (dedicated servers, virtual machines or cloud computing
These factors should be used to define best MLOps architecture. The training framework you use (like TensorFlow, PyTorch, Keras, or Scikit-learn) will dictate your packaging and serving options. The fact that models evolve through time means you will need a model registry to manage versions. Finally, your production infrastructure and use case will define how you package, secure, and deploy the model, and most importantly, how you monitor it for performance.
Option 1: Integrate ML Model with Existing API written in Python (joblib)
When your API is written with Python and your model is built with Scikit-learn this is de facto standard. Architecture is very simple but has some limitation. Model must be small enough so it can be loaded into API memory space. Since resources are shared model should require at most moderate performance. Joblib is really workhorse when you want to expose the model as a REST API endpoint within your existing Python API and you do not plan frequent updates.
Joblib is a Python library for serializing and deserializing Python objects. You have just two commands to bridge between trained model artifact created in Jupyter notebook and your API: `dump` and `load`. The model, which is typically a Scikit-learn Pipeline object, is saved to a file (usually on an S3 bucket or shared disk). Joblib is highly efficient handling memory structures especially NumPy array with compression ability. For very large models it supports memory mapping to avoid loading everything into RAM
Simplicity comes with cost. It does not have serving engine, real-time model registry or versioning out-of-the-box. It's very useful in early stage of model adoption and when the cost of building a complex MLOps pipeline outweighs the benefit.
Main caveats are related to model serialization. You must use the exact same version of Scikit-learn when you train your model and when you load it. Security is also a critical concern. Loading a file with joblib (which uses pickle format) can allow arbitrary code execution if the file is from an untrusted source. Separately, to ensure model integrity and prevent accidental corruption, you should also calculate and verify a hash (e.g., SHA-256) of the model file before loading it.
Option 2: Integrate ML Model with Existing API not written in Python (ONNX)
It's very similar to strategy with joblib. Instead of loading pickle file API load ONNX file. All major tools like PyTorch, TensorFlow, Keras, or scikit-learn are capable of creating ONNX file. The ONNX file can be loaded in any runtime eg. Java, Go, Node.js, or C# and executed with ONNX Runtime Library.
ONNX (Open Neural Network Exchange) is a file format used to decouple your model from training framework and existing non-Python API. It is not a self-contained application, just exported mathematical operators called opset (Operator Set) without pre-processing logic. Every runtime must implement own tokenization, deserialization, and inference logic. If model requires post processing that means every runtime should implement own version of it. That represents serious overhead in developing and maintaining the model.
Main friction point is opset version. The opset version used during export must be supported by the runtime library used for inference. Also, problems will occur if the model's architecture uses a new operator that isn't defined in the target opset specification. This will cause the export process itself to fail, as the converter won't know how to translate that custom or brand-new operation. Beside that tensor interface is not standardized so separate documentation is needed to keep track of expected shape and data types.
Option 3: Deploy ML Model with BentoML
This approach is one of the simplest ways to deploy any kind of ML model with the least amount of boilerplate code. Due BentoML very opinionated architecture you can simply bridge the gap between a model from a notebook and a scalable, production-ready API with high inference speed even for heavy models. BentoML is packed with lots of production-ready features including its most notable adaptive micro-batching feature. Micro batching intelligently groups incoming requests to maximize GPU/CPU throughput for free.
BentoML is a Python-native framework for serving almost any kind of machine learning model in production. It provides an out-of-the-box solution for packaging, serving, generating an API, and deploying it. It has lots of built-in tools like a CLI, a model registry for versioning, and an exposed metrics endpoint for monitoring. Two main components are the Service and the Runner. A Service is the front-end API that handles web requests, while a Runner is a separate process that executes the actual model inference. For example, a TensorFlow model would have a TensorFlow Runner. It produces immutable and versioned artifacts called Bentos.
BentoML is not an all-in-one MLOps platform. It is purpose-built for model serving and deployment, and it excels at this. You should not use it if you are looking for a tool to handle experiment tracking, data versioning, or complex training pipelines. And last but not least, BentoML has its own cloud service (BentoCloud) for hosting and managing your models, but you can also deploy Bentos as you would any other Docker image.
Option 4: Deploy ML Model with MLFlow
MLFlow is an excellent end-to-end ML lifecycle platform. It is the right choice when governance, auditability, and formal lifecycle stages (e.g., staging, production) are required. MLFlow's model format and registry are ideal for standardizing large-scale production deployments onto robust platforms like Kubernetes, cloud-native services, or other high-performance serving tools. This is used mostly for large-scale production deployments on Amazon SageMaker, Azure ML, Databricks, or even paired with BentoML. For local development it provides built-in server out-of-the-box.
MLflow provides a framework-agnostic model packaging system that bundles models with all dependencies into a standardized format. It excels at tracking experiments, logging parameters and metrics, and model management. The Model Registry provides a centralized hub for version control, allowing applications to programmatically pull the latest production model without code changes and enabling quick rollbacks if needed. Its unified API abstraction also helps prevent vendor lock-in by allowing the same deployment commands to work across different infrastructure targets, from local testing to different cloud platforms.
MLFlow configuration and setup are frequent hurdles. Self-hosting a production-ready MLflow server is a significant DevOps task, requiring a database backend (like PostgreSQL) and an artifact store (like S3).
Recommended solution
As you can see there are many options to deploy ML models. The decision should be based on your use case, team size, and infrastructure needs. When starting with deployment of models probably ONNX or joblib would do the rick but probably BentoML is a great choice for most use cases and it is very easy to get started.