Most deployments follow the same pattern: you build and train on GPUs using a de-facto standard environment based on CUDA for R&D. Then, almost by habit, you may assume deployment also has to be CUDA-centric if you want high performance.
That assumption is expensive. It nudges you into architecting your entire production stack around a single, proprietary vendor. Hardware choice shrinks. Complexity grows. And you feel “locked in”.
Here’s the reality: your trained models are already portable. A high-performance, efficient, open-standard deployment path not only exists – it’s simpler than you think.
Your Trained Model is a Blueprint, Not a Binary
The first mindset shift is understanding the difference between the environment you trained in and the artifact you produced. You may have trained with CUDA, but the model file you save is not a CUDA program.
Inside your model file is a graph of operations—convolutions, matrix multiplications, ReLUs, and more—along with the learned weights or parameters associated with those operations. In essence, it is a mathematical recipe: a precise description of what to compute, in what order, and with which parameters.
What you won’t find in the model file is any CUDA code or any hardwired notion of GPU kernels or drivers. A convolution operation remains a convolution operation, regardless of whether it is executed on a GPU, a CPU, or an NPU like Cervell. The model itself doesn’t care about the hardware; it only requires that the computation be carried out correctly.
That decoupling between the mathematical blueprint and the execution environment is what makes portability possible. It frees your model from vendor lock-in and opens the door to deploying across diverse hardware without compromise.
The Deployment Path: A Tale of Two Compilers
Once you have a portable model, you need to run it efficiently. This is where you face a choice in deployment strategy.
The Proprietary Path: TensorRT
It is a common "best practice" to use TensorRT for deployment in a CUDA environment.
But let's be clear about what TensorRT is: it's a powerful, vendor-specific compiler and optimizer. You feed it your portable model (often ONNX), and it produces a .plan file. This file is a "black box" binary, a highly optimized, but opaque, execution plan.
The catch? This plan is tightly coupled to a specific proprietary GPU architecture and, often, a specific driver version.
That’s the origin of the “lock-in” feeling. It works well on that one platform, but it’s a dead end for everything else.
The Open Path: ONNX Runtime
The alternative is ONNX Runtime (ONNX-RT), a universal, high-performance inference engine.
ONNX-RT doesn’t spit out a vendor-locked binary. Instead, it runs your standard .onnx model directly, using a flexible Execution Provider (EP) system: CPU EP for generic CPUs, CUDA EP for NVIDIA GPUs and in our case, the Semidynamics EP for Cervell NPU.
You still get advanced optimizations – graph rewrites, operator fusion, quantization to INT8/INT4 – but all within an open, standard framework, not a closed compiler. You don’t need a proprietary tool to get serious performance.
How We Plug Into the Open Ecosystem
We designed our stack to fit into this open world.
We don't force you to learn a new, proprietary AI framework. Instead, we plug directly into the open standard to deliver peak performance.
The Hardware: Cervell NPU IP
Our foundation is the Cervell NPU IP, a state-of-the-art AI accelerator built on the open RISC-V architecture. It features highly configurable vector and tensor units, allowing our IP to be tailored for any application—from the 8-TOPS C8 for low-power edge devices to the 128 -TOPS (INT4 @ 2GHz) C32 for high-performance computing.
The Software Bridge: AKL and the ONNX-RT Execution Provider
This powerful hardware is unlocked by a simple, two-part software stack:
- The Aliado Kernel Library (AKL): This is our foundational library. Our engineers have developed meticulously hand-optimized, low-level kernels for every mathematical operation that can appear in the ONNX graph of operations. The AKL is tuned to extract every cycle of performance from the Cervell's tensor and vector units, far exceeding what a generic compiler could achieve. AKL provides the optimized kernels that guarantee maximum efficiency.
- The ONNX Runtime Execution Provider: This is the simple, public-facing interface. When you ask ONNX-RT to run your model, our EP acts as the "bridge." It translates the standard operations in your ONNX graph into high-speed calls to our internal AKL.
This two-part approach provides the ultimate combination: a simple, open, and standard API for you, backed by the guaranteed bare-metal performance of our custom-tuned kernels.
A Cleaner, Simpler Workflow
For your team, this changes the deployment story completely. You are no longer managing device-locked binary files, driver-version matrices, or separate compilation steps for every hardware target.
Your application, whether written in Python or C++, remains straightforward and hardware-agnostic. You simply load your .onnx model—which, as a reminder, is just a math blueprint and contains no CUDA code—and run it through a standard API.
Typical Deployment Snippet (Python):
Python
import onnxruntime as ort
# Define provider options (if any) for Semdynamics hardware
provider_options = [...]
# Create an inference session using our Execution Provider
# Note: We're loading the portable .onnx file directly
session = ort.InferenceSession(
"your_model.onnx",
providers=["SmdAccelerator"],
provider_options=provider_options
)
# Run the model using standard ONNX-RT calls
results = session.run(None, {"input_name": input_data})
Typical Deployment Snippet (C++):
C++
#include <onnxruntime_cxx_api.h>
// Set up the session options and append our EP
Ort::SessionOptions session_options;
session_options.AppendExecutionProvider("SmdAccelerator", provider_options);
// Create the session from the same portable .onnx file
Ort::Session session(env, "your_model.onnx", session_options);
// Run inference using standard ONNX-RT API
auto results = session.Run(
Ort::RunOptions{nullptr},
input_names, &input_tensor, 1,
output_names, 1
);This is the entire integration. Your application only speaks to the ONNX Runtime API. Our stack handles all the complex kernel optimization in the background.
The Future of AI Deployment
The direction of travel is clear: Open standards for model interchange (ONNX), standard, extensible runtimes (ONNX-RT) and pluggable accelerators that you can adopt without rewriting your stack.
Your AI models are not welded to CUDA. They were never truly locked in. Once you treat the model as a portable blueprint instead of a vendor binary, you gain the freedom to choose the best hardware for each workload, to mix and match CPUs, GPUs, and NPUs and to keep performance and flexibility.
Your AI model is already free. It’s time your deployment strategy caught up.