How to Compile n Model in Hiuggingface to use it in Foundry Local

From Hugging Face to Foundry Local: A Step-by-Step Guide with Microsoft Olive

With modern developer machines increasingly equipped with dedicated GPUs or NPUs (Neural Processing Units), running advanced AI models locally is no longer experimental — it is practical.

Microsoft recently introduced Azure AI Foundry Local, a public preview CLI designed to enable on-device AI inferencing. Combined with Olive, Microsoft’s hardware-aware model optimization toolkit, we now have a streamlined and production-aligned workflow for preparing, optimizing, and deploying large language models locally.

In this article, I will demonstrate how to optimize and deploy Phi-3-mini-4k-instruct (often referred to as Phi-3.5 Mini) for local inference using Olive and Foundry Local.

Why Hardware-Aware Optimization Matters

Hardware-aware model optimization ensures that machine learning models are tuned specifically for the target compute environment — CPU, GPU, or NPU — while respecting constraints such as:

  • Latency
  • Throughput
  • Memory footprint
  • Accuracy retention

However, this is non-trivial:

  • Each hardware vendor exposes different toolchains.
  • Aggressive compression (e.g., INT4 quantization) can degrade model quality.
  • Hardware ecosystems evolve rapidly, requiring adaptable optimization workflows.

This is precisely where Olive provides value.

Olive: Hardware-Aware Optimization for ONNX Runtime

Olive is Microsoft’s end-to-end model optimization framework designed to work seamlessly with ONNX Runtime. It composes multiple optimization techniques — compression, graph transformation, quantization, and compilation — into an automated workflow.

Instead of manually experimenting with dozens of techniques, Olive uses search strategies and evaluation passes to generate the most efficient model for a given hardware target.

The Olive Optimization Workflow

Olive operates through a structured pipeline of optimization passes, including:

  • Graph capture
  • Graph-level optimization
  • Transformer-specific optimizations
  • Hardware-dependent tuning
  • Quantization (GPTQ, AWQ, etc.)
  • Runtime artifact generation

Each pass exposes tunable parameters. Olive evaluates model quality and performance and selects the best configuration based on defined constraints.

Step 1 — Environment Setup

For this walkthrough, I recommend Python 3.10. Although newer versions exist, Python 3.10 remains the most stable in practice for Olive-based optimization workflows.

py -3.10 -m venv .venv_phi3
cd .venv_phi3
python -m pip install --upgrade pip
pip install -r requirements.txt
#requirements.txt
transformers==5.1.0
onnx==1.20.1
onnx-ir==0.1.15
onnxruntime==1.23.2
onnxruntime-genai==0.11.4
onnxscript==0.6.0
optimum==2.1.0
olive==0.2.11
olive-ai==0.11.0
torch==2.10.0
torchmetrics==1.8.2
tabulate==0.9.0
tokenizers==0.22.2
requests==2.32.5
urllib3==2.6.3

Step 2 — Authenticate and Download the Model

We will use the instruct-tuned version of Microsoft’s 4B parameter Phi model:Phi-3-mini-4k-instruct

Begin by authenticating with your Hugging Face account:

hf auth login

Then download the model locally before proceeding. Although Olive can retrieve the model directly from Hugging Face during execution, explicitly downloading it in advance is recommended to ensure reproducibility, version control, and greater transparency within the optimization workflow

hf download microsoft/Phi-3-mini-4k-instruct --local-dir hf-models/Phi-3-mini-4k-instruct

In the following section, I outline three equivalent approaches for preparing a Hugging Face model for deployment with Microsoft Foundry Local:

  1. Automatic optimization using olive auto-opt
  2. Configuration-driven execution using olive run --config
  3. Step-by-step execution using dedicated CLI commands (olive capture-onnx-graph, olive optimize, olive quantize)

All three approaches ultimately produce the same result: an optimized ONNX Runtime artifact that can be deployed with Foundry Local.

In production-oriented scenarios, however, I strongly favor the configuration-driven workflow using olive run --config (see the section Alternative Ways to Achieve the Same Optimization Workflow below). While olive auto-opt is the most convenient entry point, it does not work reliably across all models, configurations, and library combinations. The configuration-based approach offers greater transparency, reproducibility, and fine-grained control over optimization passes, hardware targets, and quantization parameters—making it significantly more robust than relying exclusively on the higher-level auto-opt abstraction.

Step 3 — Automatic Optimization with olive auto-opt

The fastest way to optimize a Hugging Face model is:

olive auto-opt 
  --model_name_or_path hf-models/Phi-3-mini-4k-instruct # you can also use the model direct from HF with microsoft/Phi-3-mini-4k-instruct 
  --trust_remote_code 
  --output_path models/phi3-mini-ort 
  --device cpu 
  --provider CPUExecutionProvider 
  --use_ort_genai 
  --precision int4 
  --log_level 1

This single command performs:

  1. ONNX graph capture
  2. Transformer graph optimizations
  3. Hardware-aware graph tuning
  4. GPTQ-based INT4 quantization
  5. Conversion to ORT runtime format

If targeting GPU:

--device gpu
--provider CUDAExecutionProvider

Review the output generated by the olive auto-opt command:

+------------+-------------------+------------------------------+----------------+-----------+
| model_id   | parent_model_id   | from_pass                    |   duration_sec | metrics   |
+============+===================+==============================+================+===========+
| 479b58b6   |                   |                              |                |           |
+------------+-------------------+------------------------------+----------------+-----------+
| 15d39b94   | 479b58b6          | onnxconversion               |      0.0249999 |           |
+------------+-------------------+------------------------------+----------------+-----------+
| acaa1d8e   | 15d39b94          | modelbuilder                 |      0.0715292 |           |
+------------+-------------------+------------------------------+----------------+-----------+
| 2884cb1e   | acaa1d8e          | onnxpeepholeoptimizer        |      0.0270009 |           |
+------------+-------------------+------------------------------+----------------+-----------+
| 52f45d02   | 2884cb1e          | orttransformersoptimization  |      0.0275121 |           |
+------------+-------------------+------------------------------+----------------+-----------+
| b8019b2b   | 52f45d02          | onnxblockwisertnquantization |     37.6093    |           |
+------------+-------------------+------------------------------+----------------+-----------+
| ed7ce3c3   | b8019b2b          | extractadapters              |      0.218572  |           |
+------------+-------------------+------------------------------+----------------+-----------+

Below is an overview of the generated model artifacts. The directory contains the optimized ONNX model, runtime configuration files, tokenizer assets, and supporting Python modules required for successful deployment and execution with Foundry Local.

📁 phi3-mini-ort
├── 📄 chat_template.jinja
├── 📄 config.json
├── 🐍 configuration_phi3.py
├── 📄 genai_config.json
├── 📄 generation_config.json
├── 📦 model.onnx
├── 📦 model.onnx.data
├── 🐍 modeling_phi3.py
├── 📄 tokenizer.json
└── 📄 tokenizer_config.json

Note: CUDA workflows are generally more stable under WSL environments.

Alternative Ways to Achieve the Same Optimization Workflow

In practice, the automatic optimization workflow does not succeed consistently across all models and environments. Library version mismatches, execution provider constraints, and CUDA dependency alignment can introduce instability. From hands-on experience, CUDA-based optimization tends to execute more reliably within WSL2—particularly when GPU drivers and runtime libraries are version-aligned.

For this reason, I frequently prefer using olive run with an explicit configuration file rather than relying solely on auto-opt. While auto-opt is convenient, it abstracts the optimization pipeline and can fail silently or provide limited transparency when troubleshooting complex models. The configuration-driven approach offers deterministic control over:

  • Optimization passes
  • Quantization strategy (e.g., GPTQ parameters)
  • Execution providers
  • Target hardware definitions
  • Output artifact handling

To follow this approach, define a configuration file that specifies your model, target system, and optimization passes, and then create pass_config.json with following content :

{
  "input_model": {
    "type": "HfModel",
    "model_path": "C:\\onnx_olv\\phi3.5\\hf-models\\Phi-3-mini-4k-instruct"
  },
  "systems": {
    "local_system": {
      "type": "LocalSystem",
      "accelerators": [
        { "execution_providers": ["CPUExecutionProvider"] }
      ]
    }
  },
  "passes": {
    "quant": {
      "type": "gptq",
      "bits": 4,
      "sym": false,
      "group_size": 32,
      "lm_head": true
    },
    "modbuild": {
      "type": "ModelBuilder",
      "precision": "int4"
    }
  },
  "target": "local_system",
  "output_dir": "models/phi3-mini-ort"
}

and execute it:

olive run --config ./pass_config.json


Another alternative approach is to invoke the individual Olive CLI commands directly.

If you prefer greater control over each optimization stage rather than relying on the auto-opt abstraction, you can execute the workflow step by step. While olive auto-opt bundles ONNX export, graph optimization, and quantization into a single command, decomposing the process into discrete CLI operations allows you to validate outputs at each stage and fine-tune parameters as needed.

1. Capture ONNX Graph

Export the Hugging Face model into an ONNX graph suitable for ONNX Runtime GenAI:

olive capture-onnx-graph \
  --model_name_or_path models/phi3-mini-int4 \
  --task text-generation \
  --use_ort_genai \
  --trust_remote_code \
  --output_path models/phi3-mini-onnx \
  --device cpu \
  --provider CPUExecutionProvider

2. Optimize Graph

Apply ONNX graph optimizations (including transformer-specific optimizations when applicable):

olive run \
  --input_model models/phi3-mini-onnx \
  --output_path models/phi3-mini-instruct \
  --device cpu \
  --provider CPUExecutionProvider

3. Quantize

Quantize the optimized model to INT4 using GPTQ (or another supported algorithm):

olive quantize \
  --model_name_or_path models/phi-mini-instruct \
  --algorithm gptq \
  --precision int4 \
  --output_path models/phi3-mini-ort \
  --device cpu \
  --provider CPUExecutionProvider

Olive supports multiple quantization algorithms, some of which require GPU acceleration, as shown below.

ImplementationDescriptionModel format(s)AlgorithmGPU required ?
AWQActivation-aware Weight Quantization (AWQ) creates 4-bit quantized models and it speeds up models by 3x and reducesmemory requirements by 3x compared to FP16.PyTorch
ONNX
Awq
GPTQGenerative Pre-trained Transformer Quantization (GPTQ) is a one-shot weight quantization method. You can quantize your favorite language model to 8, 4, 3 or even 2 bits.PyTorch
ONNX
GptQ
QuarotQuaRot enables full 4-bit quantization of LLMs, including weights, activations, and KV cachet by using Hadamard rotations to remove outliers.PyTorchquarot
SpinquantSpinQuant is a quantization method that learns rotation matrices to eliminate outliers in weights and activations, improving low-bit quantization without altering model architecture.PyTorchspinquant
BitsAndBytesIs a MatMul with weight quantized with N bits (e.g., 2,3,4,5,6,7).ONNXRTN
ORTStatic and dynamic quantizations.ONNXRTN
INCIntel@ Neural Compressor model compression tool.ONNXGPTQ
NVMONVIDIA TensorRT Model Optimizer is a library comprising state-of-the-art model optimization techniques including quantization, sparsity, distillation, and pruning to compress models.ONNXAWQ
OliveOlive implemented Half-Quadratic Quantization for MatMul to 4 bits.ONNXHQQ

Below is also an overview of the available Olive CLI commands:

CommandDescription
auto-optAutomatically optimize a PyTorch model into ONNX with optional quantization.
finetuneFinetune a model on a dataset using techniques like LoRA and QLoRA.
capture-onnx-graphCapture the ONNX graph from a Hugging Face or PyTorch model.
optimizeoptimize the ONNX models using various optimization passes
quantizeQuantize a PyTorch or ONNX model using algorithms such as AWQ,QuaRoT, GPTQ, RTN and more.
extract-adaptersExtract LoRAs from PyTorch model to separate files
generate-adapterGenerate ONNX model with adapters as inputs. Only accepts ONNX models.
convert-adaptersConverts lora adapter weights to a file that will be consumed by ONNX models generated by Olive ExtractedAdapters pass
runruns the olive workflow defined in config

Step 4 — Deployment with Foundry Local

The installation and base configuration of Microsoft Foundry Local are covered in detail in my dedicated blog post here . In this section, we focus specifically on preparing the optimized model artifacts for deployment.

Before deployment, make the following configuration adjustments to ensure consistent and deterministic inference behavior:

1. In genai_config.json update the following keys:

"do_sample": false,
"temperature": 0.5,
"top_k": 50,
"top_p": 1.0

2. In tokenizer_config.json, set the tokenizer implementation explicitly to ensure compatibility with ONNX Runtime GenAI:

 "tokenizer_class": "PreTrainedTokenizerFast"

You can now configure Microsoft Foundry Local to register and serve the newly generated model.

First, locate the Foundry cache directory:This cache folder is the designated location where Microsoft Foundry Local scans for and registers locally available models for deployment.

foundry cache location
foundry cache list

Next, create a dedicated directory that will contain the optimized model artifacts:

$modelPath = Join-Path $HOME "MyModel"
mkdir $modelPath

Move the entire phi3-mini-ort directory (including all artifacts) into this location and set it as local cache location:

foundry cache cd $modelPath

Restart occurs automatically.

Restarting service...
🔴 Service is stopped.
🟢 Service is Started on http://127.0.0.1:52486/, PID 33880!

Verify that the model has been successfully registered and is recognized by Foundry Local:

foundry cache list

You should now see your newly registered model listed in the output, confirming that it has been successfully recognized by Foundry Local.

Models cached on device:
   Alias                                             Model ID
💾 phi3-mini-ort                                phi3-mini-ort

Once the Foundry Local service has successfully started, you can launch the optimized model using:

foundry run phi3-mini-ort 
🟢 Service is Started on http://127.0.0.1:64726/, PID 11912!
🕖 Downloading complete!...
Successfully downloaded and registered the following EPs: OpenVINOExecutionProvider, NvTensorRTRTXExecutionProvider, CUDAExecutionProvider.
Valid EPs: CPUExecutionProvider, WebGpuExecutionProvider, OpenVINOExecutionProvider, NvTensorRTRTXExecutionProvider, CUDAExecutionProvider
🕘 Loading model...
🟢 Model phi3.5-mini-ort loaded successfully

Interactive Chat. Enter /? or /help for help.
Press Ctrl+C to cancel generation. Type /exit to leave the chat.

Interactive mode, please enter your prompt
> 

and now Inference Time:

> What is Microsoft Foundry?
🧠 Thinking...
🤖  Microsoft Foundry is a platform developed by Microsoft that provides a suite of tools and services designed to support the digital transformation of organizations. It aims to facilitate the creation, deployment, and management of AI-powered applications and services. Foundry offers a collaborative environment for developers, data scientists, and engineers to build and scale AI models, automate tasks, and integrate AI capabilities into existing systems.

The model responds interactively using ONNX Runtime GenAI.

What This Workflow Delivers

By combining Olive and Azure AI Foundry Local, you achieve:

  • Automated, constraint-aware model optimization
  • GPTQ-based INT4 compression for reduced memory footprint
  • ORT runtime artifact generation
  • Local inference without cloud dependency
  • Support for advanced deployment patterns such as Multi-LoRA

Final Thoughts

Local AI inference is no longer a research experiment — it is a production capability.

Using Microsoft’s Olive CLI, you can transform a 7GB transformer model into an optimized, hardware-aware ONNX Runtime artifact ready for deployment on commodity hardware. Combined with Foundry Local, this creates a reproducible, enterprise-aligned workflow for secure, offline AI deployments.

As AI hardware continues to evolve — from CPUs to GPUs to dedicated NPUs — hardware-aware optimization will increasingly become a core competency for AI engineers.

And with Olive, Microsoft has provided a robust foundation for that future.

Posts created 11

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top