10xEngineers

Subscribe to our Newsletter

What's New?

Get the latest updates delivered straight to your inbox.

medical Imaging Service

AI INFRASTRUCTURE SOFTWARE · 10XENGINEERS

QuantX

Hardware-Aware Quantization for AI Inference IP Architects

Designing AI inference silicon requires critical decisions on numeric formats and architecture. QuantX brings hardware evaluation into the design loop, enabling early validation of correctness, accuracy, and performance before tape-out.

1B–14B+

Parameters — model range currently supported
*New models on the active roadmap

8+

Outlier reduction and rounding techniques supported
*More techniques on the roadmap

4

Custom numeric formats: MXFP · BFP · NVFP · FP16+INT
*Any arbitrary combination of numerics supported

PLATFORM OVERVIEW

What QuantX Does

QuantX is a hardware-aware design-space exploration platform. Unlike general-purpose post-training quantization tools designed for ML engineers, QuantX is designed for the upstream decision: choosing the number representation hard-wired into your datapath.

Design-Space Exploration

Sweep custom numeric formats
— MXFP, BFP, NVFP, FP16+INT
— against real LLM and VLM
workloads. Identify the accuracy-efficiency Pareto frontier before RTL freeze.

Hardware Validation Closure

Use QuantX as your golden reference model. Its software-simulated quantization is parameterised to match your hardware's numeric format and datapath behaviour exactly.

Inference SDK Deployment

Deploy QuantX-compressed models into your existing inference SDK — or engage 10xEngineers to build your inference SDK from the ground up, purpose-fitted to your hardware.

Automation That Replaces Manual Sweeps

QuantX's meta-optimization engine automatically allocates bit widths and selects the active algorithm combination based on your memory constraints. What previously took weeks now runs as a feed-forward pipeline.

NUMERIC FORMATS

Custom Numeric Format Support

QuantX supports four mainstream numeric format families (and customized combinations) being considered by major industry players for next-generation AI inference silicon.

Format Description
FP16+INT 16-bit FP scaling factors and integer-quantized data elements
MXFP OCP Microscaling format — Power-of-Two shared exponent blocks with FP data elements
NVFP FP8 scale factors with FP4 data elements — NVIDIA's narrow-precision inference format
BFP Block Floating Point — PoT shared exponent with integer mantissas
Custom Any arbitrary combination of numerics supported

AUTOMATION ENGINE

Meta-Optimization: Automated Multi-Level Bit Allocation

Given user-defined constraints on model weight storage and peak inference memory, meta-optimization allocates bit widths and selects the algorithm combination that best satisfies those constraints — operating feed-forward without iterative search.

Level 1

Block-Level Importance Scoring

Level 2

Tensor-Level Refinement

Level 3

Format-Specific Algorithm Selection

Business professional analyzing bar chart on tablet in office setting

MODEL SUPPORT

Supported Models

QuantX supports a curated and expanding set of open-weight LLMs and VLMs in the 1B–14B parameter range — the most commercially relevant deployment targets for custom AI inference silicon.

Language Models (LLMs)
Llama 2 Llama 3.1 Llama 3.2 Qwen 2 Qwen 2.5 Qwen 3
Vision Language Models (VLMs)
Qwen 3 VL Llava-next 1.6 7B SmolVLM SmolVLM2 Qwen 2.5 VL
Legacy & Generative Models
CLIP OPT Stable Diffusion 1.5 Stable Diffusion 3.5 Stable Diffusion XL

ARCHITECTURE

Modular Design: Built to Scale With Your Roadmap

QuantX is architected as a modular pipeline that decouples each stage of the quantization workflow. Adding a new numeric format, model, or evaluation metric requires extending a single module rather than refactoring the full pipeline.

The QuantX flow runs from HuggingFace model loading through meta-optimization, bit-width allocation, transformation selection, quantization (RTN / GPTQ), and evaluation.

Both language model evaluation (perplexity, ARC, GSM8K etc) and multimodal evaluation (TextVQA, MMBench, ChartQA etc) run natively, producing a comprehensive accuracy report in a single pipeline execution.

BENCHMARK RESULTS

Measured Accuracy Across Models & Formats

All results compare four QuantX configurations against the full-precision unquantized baseline. RTN is the accuracy floor; QuantX Performance is the best-effort configuration. All results are without any fine tuning.

INT + FP Format — Wikitext-2 Perplexity (lower is better)

Unquantized (baseline)
9.757
Vanilla RTN
11.405
10x Quant Strategy*
10.844
QuantX Runtime
10.595
QuantX Performance ★
10.421

BFP Format — Wikitext-2 Perplexity (lower is better)

Unquantized (baseline)
7.213
Vanilla RTN
8.515
10x Quant Strategy*
7.577
QuantX Runtime
7.730
QuantX Performance ★
7.510

ARC Challenge (higher is better)

Unquantized
55.80
Vanilla RTN
51.62
10x Quant Strategy*
53.8
QuantX Runtime
52.9
QuantX Performance ★
54.6

GSM8K (higher is better)

Unquantized
75.5
Vanilla RTN
66.0
10x Quant Strategy*
72.0
QuantX Runtime
72.5
QuantX Performance ★
76.0

INT + FP Format — TextVQA (higher is better)

Unquantized (baseline)
80.82
Vanilla RTN
73.88
10x Quant Strategy*
75.90
QuantX Runtime
75.98
QuantX Performance ★
77.70

BFP Format — MMBench (higher is better)

Unquantized
86.08
Vanilla RTN
82.22
10x Quant Strategy*
84.88
QuantX Runtime
84.79
QuantX Performance ★
83.50

ChartQA (higher is better)

Unquantized
76.20
Vanilla RTN
55.53
10x Quant Strategy*
73.53
QuantX Runtime
76.06
QuantX Performance ★
74.01

Key takeaway: Across all four model-format combinations, QuantX consistently and significantly outperforms vanilla RTN. On three of four benchmarks, QuantX Performance recovers more than 50% of the accuracy gap to baseline.

THE CONSOLE BEHIND THE NUMBERS

CASE STUDY

QuantX on Tenstorrent Hardware

QuantX-generated quantization settings deployed on Tenstorrent N300 silicon — measured against TT default published model settings across accuracy, storage, and throughput.

Llama 3.1 8B
Accuracy top 1% — TT default85.2
Accuracy top 1% — QuantX Runtime93.8
Accuracy top 1% — QuantX Performance95.6
Storage (Weight+KV) — QuantX Runtime / QuantX Performance GB8.45 / 7.1
Tokens/sec — TT default / QuantX Runtime13 / 12.4 / -
8%+ accuracy improvement (top 1%) with QuantX Runtime at the cost of 34% DRAM storage increase over TT defaultQuantX Performance achieves 10%+ accuracy improvement at only 12% more DRAM storage over TT default
Qwen 2.5 14B
Accuracy top 1% — TT default89.6
Accuracy top 1% — QuantX Runtime94.4
Accuracy top 1% — QuantX Performance93.6
Storage (Weight+KV) — QuantX Runtime / QuantX Performance GB15 / 10.8
Tokens/sec — TT default / QuantX Runtime8.5 / 8.14 / -
~5% accuracy improvement (top 1%) with QuantX Runtime at the cost of 31% DRAM storage increase over TT defaultQuantX Performance achieves 4% accuracy and 5% storage improvement over TT default
TT default — 'performance' published model
QuantX Runtime — QuantX generated setting, runs without changes in Tenstorrent Software
QuantX Performance — QuantX generated setting requiring custom dequant kernel in TT-metallium

** Results using the TT inference test (prefill 512 tokens, generation 511 tokens) published online using an N300 in a single user setting. TT default — 'performance' published model settings v0.63.0 of tt-metal.

QuantX + Baltoro: The Complete AI Inference Stack

QuantX does not operate in isolation. When combined with Baltoro — 10xEngineers’ RISC-V-first AI compiler stack built on MLIR — QuantX-compressed models can be lowered all the way to optimised machine code for custom silicon.

FAQ

Frequently Asked Questions

QuantX is a hardware-aware quantization design-space exploration platform developed by 10xEngineers. It is purpose-built for AI inference IP architects who need to evaluate custom numeric formats — including MXFP, BFP, INT8, and NVFP — against real LLMs and VLMs before committing to silicon tape-out. It also serves as a golden reference model for hardware validation and supports inference SDK deployment.

Tools like GPTQ, AutoAWQ, and bitsandbytes are designed for ML engineers deploying models on existing hardware with fixed numeric formats. QuantX is designed for an earlier and distinct decision: which numeric format to implement in new silicon. It natively supports custom and non-standard formats (BFP, NVFP) and provides hardware validation closure capabilities that deployment tools do not offer.

QuantX supports FP16 scale with INT elements, MXFP (Power-of-Two scale with FP elements per the OCP Microscaling standard), NVFP (FP8 scale + FP4 elements), and BFP (Block Floating Point with INT elements). Additional formats can be integrated through QuantX’s modular pipeline architecture.

QuantX supports Llama 2, Llama 3.1, and Llama 3.2; Qwen 2, Qwen 2.5, Qwen 3, and Qwen 3 VL; Llava-next 1.6 7B; SmolVLM and SmolVLM2; and legacy models including CLIP, OPT, and Stable Diffusion 1.5 / 3.5 / XL. 10xEngineers adds approximately one new model per month.

Meta-optimization is QuantX’s automated multi-level bit allocation and algorithm selection engine. Given user constraints on model weight storage and peak inference memory, it allocates bit widths first at the transformer block level using importance scores, then refines at the per-tensor level within attention mechanisms — simultaneously selecting which quantization algorithms to activate based on the target numeric format.

Yes. QuantX’s software-simulated quantization model is parameterised to match a specific hardware numeric format and datapath behaviour, and serves as the reference against which RTL simulations and physical silicon measurements are compared. 10xEngineers provides full-stack support for this workflow.

Yes. Beyond the QuantX platform, 10xEngineers offers two deployment services: integration of QuantX-compressed models into an existing inference SDK, and full inference SDK development from scratch tailored to the target hardware — drawing on full-stack expertise across ML compilers, runtime systems, and hardware-software co-design.

Talk to the QuantX Team

If you are architecting custom AI inference silicon and want to validate your numeric format decisions before tape-out — or if you need a golden reference model, a hardware validation partner, or an inference SDK — let’s talk.

ISP form