AI INFRASTRUCTURE SOFTWARE · 10XENGINEERS

QuantX

Hardware-Aware Quantization for AI Inference IP Architects

Designing AI inference silicon requires critical decisions on numeric formats and architecture. QuantX brings hardware evaluation into the design loop, enabling early validation of correctness, accuracy, and performance before tape-out.

1B–14B+

Parameters — model range currently supported
*New models on the active roadmap

8+

Outlier reduction and rounding techniques supported
*More techniques on the roadmap

4

Custom numeric formats: MXFP · BFP · NVFP · FP16+INT
*Any arbitrary combination of numerics supported

PLATFORM OVERVIEW

What QuantX Does

QuantX is a hardware-aware design-space exploration platform. Unlike general-purpose post-training quantization tools designed for ML engineers, QuantX is designed for the upstream decision: choosing the number representation hard-wired into your datapath.

Design-Space Exploration

Sweep custom numeric formats
— MXFP, BFP, NVFP, FP16+INT
— against real LLM and VLM
workloads. Identify the accuracy-efficiency Pareto frontier before RTL freeze.

Hardware Validation Closure

Use QuantX as your golden reference model. Its software-simulated quantization is parameterised to match your hardware's numeric format and datapath behaviour exactly.

Inference SDK Deployment

Deploy QuantX-compressed models into your existing inference SDK — or engage 10xEngineers to build your inference SDK from the ground up, purpose-fitted to your hardware.

Automation That Replaces Manual Sweeps

QuantX's meta-optimization engine automatically allocates bit widths and selects the active algorithm combination based on your memory constraints. What previously took weeks now runs as a feed-forward pipeline.

NUMERIC FORMATS

Custom Numeric Format Support

QuantX supports four mainstream numeric format families (and customized combinations) being considered by major industry players for next-generation AI inference silicon.

Format	Description
FP16+INT	16-bit FP scaling factors and integer-quantized data elements
MXFP	OCP Microscaling format — Power-of-Two shared exponent blocks with FP data elements
NVFP	FP8 scale factors with FP4 data elements — NVIDIA's narrow-precision inference format
BFP	Block Floating Point — PoT shared exponent with integer mantissas
Custom	Any arbitrary combination of numerics supported

AUTOMATION ENGINE

Meta-Optimization: Automated Multi-Level Bit Allocation

Given user-defined constraints on model weight storage and peak inference memory, meta-optimization allocates bit widths and selects the algorithm combination that best satisfies those constraints — operating feed-forward without iterative search.

Level 1

Block-Level Importance Scoring

Level 2

Tensor-Level Refinement

Level 3

Format-Specific Algorithm Selection

MODEL SUPPORT

Supported Models

QuantX supports a curated and expanding set of open-weight LLMs and VLMs in the 1B–14B parameter range — the most commercially relevant deployment targets for custom AI inference silicon.

Language Models (LLMs)

Llama 2 Llama 3.1 Llama 3.2 Qwen 2 Qwen 2.5 Qwen 3

Vision Language Models (VLMs)

Qwen 3 VL Llava-next 1.6 7B SmolVLM SmolVLM2 Qwen 2.5 VL

Legacy & Generative Models

CLIP OPT Stable Diffusion 1.5 Stable Diffusion 3.5 Stable Diffusion XL

ARCHITECTURE

Modular Design: Built to Scale With Your Roadmap

QuantX is architected as a modular pipeline that decouples each stage of the quantization workflow. Adding a new numeric format, model, or evaluation metric requires extending a single module rather than refactoring the full pipeline.

The QuantX flow runs from HuggingFace model loading through meta-optimization, bit-width allocation, transformation selection, quantization (RTN / GPTQ), and evaluation.

Both language model evaluation (perplexity, ARC, GSM8K etc) and multimodal evaluation (TextVQA, MMBench, ChartQA etc) run natively, producing a comprehensive accuracy report in a single pipeline execution.

BENCHMARK RESULTS

Measured Accuracy Across Models & Formats

All results compare four QuantX configurations against the full-precision unquantized baseline. RTN is the accuracy floor; QuantX Performance is the best-effort configuration. All results are without any fine tuning.

INT + FP Format — Wikitext-2 Perplexity (lower is better)

Unquantized (baseline)

9.757

Vanilla RTN

11.405

10x Quant Strategy*

10.844

QuantX Runtime

10.595

QuantX Performance ★

10.421

BFP Format — Wikitext-2 Perplexity (lower is better)

Unquantized (baseline)

7.213

Vanilla RTN

8.515

10x Quant Strategy*

7.577

QuantX Runtime

7.730

QuantX Performance ★

7.510

ARC Challenge (higher is better)

Unquantized

55.80

Vanilla RTN

51.62

10x Quant Strategy*

53.8

QuantX Runtime

52.9

QuantX Performance ★

54.6

GSM8K (higher is better)

Unquantized

75.5

Vanilla RTN

66.0

10x Quant Strategy*

72.0

QuantX Runtime

72.5

QuantX Performance ★

76.0

INT + FP Format — TextVQA (higher is better)

Unquantized (baseline)

80.82

Vanilla RTN

73.88

10x Quant Strategy*

75.90

QuantX Runtime

75.98

QuantX Performance ★

77.70

BFP Format — MMBench (higher is better)

Unquantized

86.08

Vanilla RTN

82.22

10x Quant Strategy*

84.88

QuantX Runtime

84.79

QuantX Performance ★

83.50

ChartQA (higher is better)

Unquantized

76.20

Vanilla RTN

55.53

10x Quant Strategy*

73.53

QuantX Runtime

76.06

QuantX Performance ★

74.01

Key takeaway: Across all four model-format combinations, QuantX consistently and significantly outperforms vanilla RTN. On three of four benchmarks, QuantX Performance recovers more than 50% of the accuracy gap to baseline.

THE CONSOLE BEHIND THE NUMBERS

Dashboard Monitor all jobs at a glance — total, running, completed, and failed — with per-job chain status across optimization and evaluation stages.

Step 1 — Model Selection Choose from a curated library of LLMs and VLMs. Browse by model family and select the exact checkpoint to optimize.

Step 2 — Optimization Mode Select Single Precision or Meta-Optimization. Configure weight and activation formats independently — INT4 through INT8, MXFP variants, NVFP4, and BFP families.

Evaluation Results Completed jobs surface evaluation metrics — Wikitext-2 perplexity, ARC, GSM8K, TextVQA — alongside downloadable artifact packages for downstream integration.

CASE STUDY

QuantX on Tenstorrent Hardware

QuantX-generated quantization settings deployed on Tenstorrent N300 silicon — measured against TT default published model settings across accuracy, storage, and throughput.

Llama 3.1 8B

Accuracy top 1% — TT default85.2

Accuracy top 1% — QuantX Runtime93.8

Accuracy top 1% — QuantX Performance95.6

Storage (Weight+KV) — QuantX Runtime / QuantX Performance GB8.45 / 7.1

Tokens/sec — TT default / QuantX Runtime13 / 12.4 / -

→ 8%+ accuracy improvement (top 1%) with QuantX Runtime at the cost of 34% DRAM storage increase over TT default→ QuantX Performance achieves 10%+ accuracy improvement at only 12% more DRAM storage over TT default

Qwen 2.5 14B

Accuracy top 1% — TT default89.6

Accuracy top 1% — QuantX Runtime94.4

Accuracy top 1% — QuantX Performance93.6

Storage (Weight+KV) — QuantX Runtime / QuantX Performance GB15 / 10.8

Tokens/sec — TT default / QuantX Runtime8.5 / 8.14 / -

→ ~5% accuracy improvement (top 1%) with QuantX Runtime at the cost of 31% DRAM storage increase over TT default→ QuantX Performance achieves 4% accuracy and 5% storage improvement over TT default

TT default — 'performance' published model

QuantX Runtime — QuantX generated setting, runs without changes in Tenstorrent Software

QuantX Performance — QuantX generated setting requiring custom dequant kernel in TT-metallium

** Results using the TT inference test (prefill 512 tokens, generation 511 tokens) published online using an N300 in a single user setting. TT default — 'performance' published model settings v0.63.0 of tt-metal.

QuantX + Baltoro: The Complete AI Inference Stack

QuantX does not operate in isolation. When combined with Baltoro — 10xEngineers’ RISC-V-first AI compiler stack built on MLIR — QuantX-compressed models can be lowered all the way to optimised machine code for custom silicon.

FAQ

Frequently Asked Questions

What is QuantX?

QuantX is a hardware-aware quantization design-space exploration platform developed by 10xEngineers. It is purpose-built for AI inference IP architects who need to evaluate custom numeric formats — including MXFP, BFP, INT8, and NVFP — against real LLMs and VLMs before committing to silicon tape-out. It also serves as a golden reference model for hardware validation and supports inference SDK deployment.

How is QuantX different from GPTQ, AutoAWQ, or bitsandbytes?

Tools like GPTQ, AutoAWQ, and bitsandbytes are designed for ML engineers deploying models on existing hardware with fixed numeric formats. QuantX is designed for an earlier and distinct decision: which numeric format to implement in new silicon. It natively supports custom and non-standard formats (BFP, NVFP) and provides hardware validation closure capabilities that deployment tools do not offer.

What numeric formats does QuantX support?

QuantX supports FP16 scale with INT elements, MXFP (Power-of-Two scale with FP elements per the OCP Microscaling standard), NVFP (FP8 scale + FP4 elements), and BFP (Block Floating Point with INT elements). Additional formats can be integrated through QuantX’s modular pipeline architecture.

What AI models does QuantX support?

QuantX supports Llama 2, Llama 3.1, and Llama 3.2; Qwen 2, Qwen 2.5, Qwen 3, and Qwen 3 VL; Llava-next 1.6 7B; SmolVLM and SmolVLM2; and legacy models including CLIP, OPT, and Stable Diffusion 1.5 / 3.5 / XL. 10xEngineers adds approximately one new model per month.

What is meta-optimization and how does it work?

Meta-optimization is QuantX’s automated multi-level bit allocation and algorithm selection engine. Given user constraints on model weight storage and peak inference memory, it allocates bit widths first at the transformer block level using importance scores, then refines at the per-tensor level within attention mechanisms — simultaneously selecting which quantization algorithms to activate based on the target numeric format.

Can QuantX serve as a golden reference model for hardware validation?

Yes. QuantX’s software-simulated quantization model is parameterised to match a specific hardware numeric format and datapath behaviour, and serves as the reference against which RTL simulations and physical silicon measurements are compared. 10xEngineers provides full-stack support for this workflow.

Does 10xEngineers build inference SDKs?

Yes. Beyond the QuantX platform, 10xEngineers offers two deployment services: integration of QuantX-compressed models into an existing inference SDK, and full inference SDK development from scratch tailored to the target hardware — drawing on full-stack expertise across ML compilers, runtime systems, and hardware-software co-design.

Talk to the QuantX Team

If you are architecting custom AI inference silicon and want to validate your numeric format decisions before tape-out — or if you need a golden reference model, a hardware validation partner, or an inference SDK — let’s talk.

QuantX

Hardware-Aware Quantization for AI Inference IP Architects

1B–14B+

8+

4

What QuantX Does

Design-Space Exploration

Hardware Validation Closure

Inference SDK Deployment

Automation That Replaces Manual Sweeps

Custom Numeric Format Support

Meta-Optimization: Automated Multi-Level Bit Allocation

Block-Level Importance Scoring

Tensor-Level Refinement

Format-Specific Algorithm Selection

Supported Models

Language Models (LLMs)

Vision Language Models (VLMs)

Legacy & Generative Models

Modular Design: Built to Scale With Your Roadmap

Measured Accuracy Across Models & Formats

QuantX on Tenstorrent Hardware

QuantX + Baltoro: The Complete AI Inference Stack

Frequently Asked Questions

What is QuantX?

How is QuantX different from GPTQ, AutoAWQ, or bitsandbytes?

What numeric formats does QuantX support?

What AI models does QuantX support?

What is meta-optimization and how does it work?

Can QuantX serve as a golden reference model for hardware validation?

Does 10xEngineers build inference SDKs?

Talk to the QuantX Team

Products

Services

Contact us