TensorRT-LLM

Overview

Jan uses TensorRT-LLM as an optional engine for faster inference on NVIDIA GPUs. This engine uses Cortex-TensorRT-LLM (opens in a new tab), which includes an efficient C++ server that executes the TRT-LLM C++ runtime (opens in a new tab) natively. It also includes features and performance improvements like OpenAI compatibility, tokenizer improvements, and queues.

TensorRT-LLM engine is only available for Windows users, Linux support is coming soon!

Requirements

NVIDIA GPU with Compute Capability 7.0 or higher (RTX 20xx series and above)
Minimum 8GB VRAM (16GB+ recommended for larger models)
Updated NVIDIA drivers
CUDA Toolkit 11.8 or newer

For detailed setup guide, please visit Windows.

Enable TensorRT-LLM

Step 1: Install Additional Dependencies

Navigate to Settings () > Local Engine > TensorRT-LLM:
At Additional Dependencies, click Install

Click Tensor

Verify that files are correctly downloaded:


ls ~/jan/data/extensions/@janhq/tensorrt-llm-extension/dist/bin
# Your Extension Folder should now include `cortex.exe`, among other artifacts needed to run TRT-LLM

Restart Jan

Step 2: Download Compatible Models

TensorRT-LLM can only run models in TensorRT format. These models, also known as "TensorRT Engines", are prebuilt specifically for each operating system and GPU architecture.

We currently offer a selection of precompiled models optimized for NVIDIA Ampere and Ada GPUs that you can use right away:

Go to Hub
Look for models with the TensorRT-LLM label & make sure they're within your hardware compatibility
Click Download

This download might take some time as TensorRT models are typically large files.

Download TensorRT-LLM Model

Step 3: Start Threads

Once the model(s) is downloaded, start using it in Threads

llama.cpp OpenAI