Cover Image for Own Your AI - Hardware, Models, Tuning: How to Actually Make Local LLMs Productive
Cover Image for Own Your AI - Hardware, Models, Tuning: How to Actually Make Local LLMs Productive
38 Going

Own Your AI - Hardware, Models, Tuning: How to Actually Make Local LLMs Productive

Hosted by Andreas Burner & Andreas Petersson
Registration
Event Full
If you’d like, you can join the waitlist.
Please click on the button below to join the waitlist. You will be notified if additional spots become available.
About Event

** We have a few places left. Apply to the waiting list with your motivation to get a seat **

AI is becoming infrastructure. The question is no longer whether you use it — it's who controls it, who pays for it, and who sees your data.

Own Your AI is a new engineering group for practitioners who want to own AI, not rent it. Sovereignty over your models, your data, and your costs. Independence from a handful of hyperscalers. Mastery from the token up to the hardware. A community of people who actively build systems and want to exchange notes with peers who do the same.

For Event #1, we start exactly where most local-AI projects fail: at hardware decisions and inference tuning. Two talks from practitioners who have done this in real client engagements, plus lightning sessions and open discussion.

Who is this for?

  • LLM Engineers & Builders — you work with prompts, agents, and models every day and want to get more out of your local stack than the defaults deliver

  • Platform & Infrastructure Engineers — you run inference servers, fight with VRAM, KV cache, and tokens per second, and have to hit SLAs

  • Architects — you decide whether a new system runs on-prem, hybrid, or in the cloud, and you need solid numbers instead of vendor benchmarks

  • Engineering Leaders & Decision Makers — you plan budget and roadmap for AI workloads and want to know at which use case local actually pays off

  • Compliance, Legal & Security — you have to reconcile data residency, EU AI Act, and audit requirements with what the engineering teams want to deploy

Agenda

Talk 1 — Andreas Petersson

The Decision Matrix: Which LLM on Which Hardware — and When the Cloud Is the More Honest Answer

A practitioner's guide through the current inference hardware market — from 2,000-euro mini PCs to the H100. Which models actually run well on which system, where are the real limits, and at what workload does the math tip in favour of private or public cloud?

We walk through the relevant options one by one — Mac Studio as the unified-memory workhorse, Mac Mini as the cheapest entry point, AMD Ryzen AI Max+ 395 (e.g. as the Zotac ZBOX Magnus) as a new x86 contender with a large unified-memory pool, Nvidia consumer GPUs (RTX 4090 5090) for maximum tokens per second in single-user mode, Nvidia enterprise H100 A100 for multi-tenant inference, and private vs. public cloud services (dedicated EU providers, Bedrock & Co.) as the comparison anchor. For each platform: which model sizes and quantisations are realistic, which context windows work without swap, and what the total cost of ownership actually adds up to.

What we'll look at:

  • A concrete Decision Matrix: use case → model class → minimal viable hardware

  • Realistic tokens-per-second numbers for 7B, 14B, 30B, and 70B+ models across platforms

  • Memory-bandwidth and VRAM limits where most setups fail in production

  • A direct cost comparison: CapEx (hardware) vs. OpEx (cloud)

  • When unified memory (Apple, AMD) beats the GPU — and when it doesn't

  • Hybrid architectures: local vs. cloud for burst workloads

Talk 2 — Andreas Burner

3× Faster, Same Hardware: 10 Tuning Knobs That Turn Your Local LLM Stack From a Toy Into a Tool

Most local LLM setups run on default parameters — and deliver a fraction of what the hardware can do. This talk shows how systematic tuning of llama.cpp, Ollama, and LM Studio took throughput from 0.3 tok/s to 6 tok/s on unchanged hardware — and why the same levers apply to vLLM on OpenShift AI.

Inference parameters do not act in isolation: temperature interacts with sampling, context size with KV cache and parallel slots, thread count with CPU topology. Two parameters set right and one bad default cost you most of the performance. We walk through the ten most important knobs one by one — measured against real coding benchmarks, not synthetic perplexity scores — and look at what becomes possible with an agent harness when the inference layer is finally configured correctly.

What we'll look at:

  • Batch and context sizing — significantly faster prompt processing without swapping hardware

  • Parallel slots & KV cache — the most common reason local setups end up in swap

  • CPU thread pinning on Intel hybrid cores — up to 3× throughput from a single setting

  • KV cache quantisation — when it wins on CPU and silently costs quality on GPU

  • Reasoning budget for thinking models — why the wrong cap is worse than no thinking at all

  • Context window sync between agent harness and inference server — the invisible bug behind aborted long-context tasks

  • Translation to vLLM / OpenShift AI — what carries over, what changes, what becomes more important in multi-tenant setups

Format

  • Two focus talks, lightning sessions, open discussion, and networking

  • Language: talks in English or German — the audience decides

  • Location: Vienna, in person only. No online attendance.

  • Cadence: every 1–2 months

Why in person only? The value is in the hallway conversations, the whiteboard sketches, and the people you meet.

Hosts

Andreas Burner — Management Advisor for AI & Cloud Strategy, Board Advisor, and Court-Certified Expert. 25+ years in global enterprise technology. BurnerNet.com

Andreas Petersson — Technology Advisor and Founder of Capacity (capacity.at). Two decades of experience building and auditing decentralised, security-critical systems for enterprise and the public sector.

Registration

Seats are limited — we deliberately keep the session small so that discussion and networking actually work. Register here via Luma. You'll receive the exact location, the final speaker line-up, and the agenda in good time before the event.

Own the stack. Own the data. Own the future.

More Information: https://OwnYourAI.eu

Location
4future
Graben 17/10, 1010 Wien, Austria
38 Going