DevMetrix.cloud - Free Developer Tools & Utilities

Why Java for Inference?

Python services (Flask/FastAPI) are common for ML, but if your core business logic is in Spring Boot, adding a Python sidecar adds network latency and operational complexity. Running the model in-process (within the JVM) offers:

Zero Network Latency: No HTTP hop between app and model.
Simplified Ops: Single deployment artifact (JAR/Docker image).
Robust Concurrency: Leverage Java's mature threading model.

Exporting Models to ONNX

ONNX is an open standard for machine learning interoperability. Most frameworks (PyTorch, TensorFlow, Scikit-Learn) support exporting to '.onnx'.

# Python: Exporting a Scikit-Learn model
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, 4]))]
onx = convert_sklearn(clf, initial_types=initial_type)
with open("model.onnx", "wb") as f:
    f.write(onx.SerializeToString())

Setting up ONNX Runtime in Java

Microsoft maintains a high-performance Java API for ONNX Runtime. Add the dependency:

<dependency>
    <groupId>com.microsoft.onnxruntime</groupId>
    <artifactId>onnxruntime</artifactId>
    <version>1.17.0</version>
</dependency>

Building the Inference Service

We wrap the ONNX session in a Spring Service. We load the model only once at startup.

@Service
public class PredictionService implements AutoCloseable {

    private final OrtEnvironment env;
    private final OrtSession session;

    public PredictionService(@Value("${model.path}") String modelPath) throws OrtException {
        this.env = OrtEnvironment.getEnvironment();
        this.session = env.createSession(modelPath, new OrtSession.SessionOptions());
    }

    public float predict(float[] inputData) throws OrtException {
        // 1. Create Tensor
        OnnxTensor tensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(inputData), new long[]{1, inputData.length});

        // 2. Run Inference
        try (Result result = session.run(Map.of("float_input", tensor))) {
            float[][] output = (float[][]) result.get(0).getValue();
            return output[0][0]; // Return first prediction
        }
    }

    @Override
    public void close() throws OrtException {
        session.close();
        env.close();
    }
}

Security Considerations

Loading external binary files (models) into your application memory poses unique security risks. "Pickle bombs" are famous in Python, but ONNX files are generally safer as they are protobuf-based graphs. However, risks remain.

ML Model Security Checklist

Model Provenance: Only load models from trusted sources. Verify the SHA256 checksum of the `.onnx` file before loading.
Input Validation: ML models are sensitive to malformed inputs (NaNs, Infinities, extreme outliers). Validate all numerical inputs before tensor creation to prevent crashes or undefined behavior.
Resource Exhaustion: Models can be memory-intensive. Set strict heap limits and monitor off-heap memory usage (ONNX Runtime uses native memory).
Adversarial Attacks: Be aware that models can be tricked by subtly perturbed inputs. Implement rate limiting and anomaly detection on the input distribution.

Conclusion

Deploying ML models in Spring Boot using ONNX Runtime is a powerful pattern for high-performance, low-latency AI applications. It simplifies your architecture by removing the need for separate Python microservices for inference.

Written by the DevMetrix Team • Published December 11, 2025