Why Local AI Matters: How to Choose the Right Model Size for Your Device - Serkan Aytekin

Running AI models on your own device is one of the biggest tech trends right now.
Tools like Foundry Local, Ollama, LM Studio, and llama.cpp make it possible to run advanced language models directly on your laptop — fully offline and without sending data to the cloud.

At Microsoft Ignite 2025, Microsoft also introduced major updates to Foundry and Foundry Local, making local inference even easier.

Until now, Ollama and LM Studio have been my main tools for running local models, thanks to their simplicity and great UIs. Lately, I’ve been exploring Foundry Local, and I’m really impressed with its potential — I expect it to become a much bigger part of my workflow going forward.

Why Run an AI Model Locally?

Here’s a practical example from my workflow:

I use Obsidian for note-taking, and my notes often include many screenshots.
Naming each image manually during note-taking is slow and annoying. This results in a folder full of files like:

These names provide no hint about what’s inside the image.

Since I already use Tesseract OCR in another part of my workflow, I decided to reuse it here as well. So I created a small script that sends three pieces of information to a local Phi-3.5 model:

the latest note text
the file path
OCR text extracted with Tesseract, which helps the model understand what’s visible inside the screenshot

Using this combined context, the model generates a clean, meaningful, human-readable filename.

Here’s the simple prompt I use:

You are an expert at creating file names from text.
You output ONLY a very short, filename-safe slug from the given context.
OCR Text:

Nearest heading or line:
${(contextObj.heading || "").slice(0, 300)}

Folder path context (ignore generic segments like attachments/images):
${(contextObj.pathCtx || "").slice(0, 300)}

Note context excerpt (unstructured; may include headings/paragraphs):
${(contextObj.noteCtx || "").slice(0, 300)}

Rules:
- Output ONLY 2–4 simple words, lowercase, joined by single hyphens.
- Do NOT include file extensions (png, jpg, pdf), timestamps, or generic words (attachments, pasted, image, screenshot).
- Prefer specific domain terms from OCR; otherwise from heading; otherwise from note context; otherwise from path context.
- No punctuation except hyphens. No quotes. No extra commentary.
Slug:

See the filenames created from local running Phi 3.5 below:

It works surprisingly well — and this is exactly the kind of task where local AI shines: fast, private, automated workflows without relying on the cloud.

How to Choose the Right Model Size for Local Use

When deciding which AI model to run locally, the two most important factors are:

Available memory (VRAM or RAM)
Precision / quantization level

From these, you can estimate the maximum model size you can run:

Max model size (in billions of params) ≈ usable_memory / (bytes_per_param × overhead)

Where Will Your Model Run? GPU or CPU?

If you have a dedicated GPU, the model will usually try to run in VRAM.If the GPU is too weak or too small, it will fall back to system RAM. NPUs exist on many CPUs, but most LLM tools still don’t fully support them — so we usually ignore NPU for sizing.

Also remember:
Your entire memory isn’t available for the model. The OS and apps need memory too.

Example:
A 16 GB RAM laptop realistically has 10–12 GB free for the model.

Bytes Per Parameter: Understanding Precision / Quantization

Every model has parameters (weights), and each parameter takes up space depending on its precision:

Precision	Bits	Bytes per parameter
Float32 (FP32)	32	4 bytes
Float16 / BF16	16	2 bytes
Int8 (8-bit)	8	1 byte
Int4 (4-bit quant)	4	0.5 bytes
Int2 (2-bit quant)	2	0.25 bytes

Most local setups use 4-bit or 8-bit quantization because it drastically reduces memory usage.To account for activation buffers, KV cache, and temporary tensors, we add an overhead factor (usually 1.3–1.5).

Simple Formula for Maximum Model Size

Let:

M = usable memory in GB
B = bytes per parameter
O = overhead factor (e.g. 1.3)

1. Convert memory to bytes

M_bytes = M × 1,000,000,000

2. Compute max parameters

max_params = M_bytes / (B × O)

3. Convert to billions of parameters

max_params / 1,000,000,000

Example: My Machine

Total RAM: 64 GB
Reserve for OS/apps: 20 GB
Available for model: 44 GB
Precision: 4-bit (0.5 bytes)
Overhead: 1.3

M_bytes = 44 × 1,000,000,000 = 44,000,000,000 
Effective bytes per parameter = 0.5 × 1.3 = 0.65 
max_params = 44,000,000,000 / 0.65 ≈ 67.7B

In theory, I can load a ~67B parameter model.

In practice, I prefer models under 16B because they run noticeably faster and feel more responsive on my system.

What Exactly Is 4-Bit Quantization?

Models are originally trained at high precision (usually FP32 or BF16).

Quantization compresses these parameters to a lower precision format.

4-bit quantization means:

Each weight uses only 4 bits instead of 32
Model size drops by up to 8×
Inference becomes much faster
Some accuracy is lost — but usually not enough to matter for everyday tasks

Quantization does not change:

the number of parameters
the architecture
the model’s general capabilities

It only changes how efficiently the model can run on your machine.

Conclusion

Running AI models locally is becoming easier, faster, and more privacy-friendly.
Tools like Foundry Local, Ollama, and LM Studio let anyone experiment with powerful models offline.

To choose the right local model, focus on:

your available RAM/VRAM
the quantization level
real-world performance, not theoretical maximums

And once you start automating your workflows — like I did with automatic screenshot naming — you’ll quickly see the benefits of locally running AI.

Resources Mentioned

Here are the tools and references mentioned throughout the post:

Learn more about note-taking with obsidian: https://obsidian.md
Microsoft’s Phi models: https://azure.microsoft.com/en-us/products/phi
Run AI locally with Foundry Local: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/what-is-foundry-local?view=foundry-classic
Tesseract: Open-source OCR for image-to-text: https://github.com/tesseract-ocr/tesseract
Simple local model runner (Ollama): https://ollama.com
Desktop LLM interface (LM Studio): https://lmstudio.ai/

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28