Skip to content

What Folding Laundry Taught Me About Working With AI

Published:

What Folding Laundry Taught Me About Working With AI

Yesterday evening I was folding laundry. It was one of those pesky loads, the basket was filled with socks. We’re a four person household, in theory, that should make it easier to distinguish all the socks.

I made some space on the table to accommodate the individual socks. Laying them out flat helps find pairs. After folding one-third of the container, I realized that the space I had assigned was way too small and was already overflowing. Since the rest of the table was full, there was no more space to allocate for more socks. This was a seemingly simple and mundane task, that suddenly induced stress in me. Where would I put all these socks now?

Granted, the solution was quite easy in this case. I created some space by stowing some folded laundry, and I had enough room for the socks. What’s the connection to working with AI, you ask?

When AI became publicly available with the launch of ChatGPT, many people immediately recognized this technology’s potential. Recognizing that it’s a new technology with many unknowns, they created companies and planned generously, allowing the companies ample time to find product-market fit and generate revenue.

Stress occurs when plans and reality diverge. It’s the same mechanism, whether you have enough space for your socks or how much runway your company has. Right now, we see many companies entering a stressful phase, especially the big ones. OpenAI, for example, issued a Code Red in an internal memo. Apple abruptly fired their AI chief, John Giannandrea.

Delivering value with AI is a lot harder than everyone thought, we underestimated the complexity of AI. This has led investors to attempt crazy things, this TechCrunch article provides an absurd example: Pumping $90 million dollars into a business with an annual recurring revenue of around $400,000, valuing it at $415 million dollars sounds absurd. This strategy is called king making: declaring a winner in a market and hoping to convince customers to choose the “market leader.” It’s another symptom of the stress we’re seeing in the system right now.

This great article by Paul Ford brings it all together. He wishes for the bubble to burst, because the the frenzy for return on invest ends and we can focus on letting nerds do their best work.

Happy hacking!

Why You Should Buy an AMD machine for Local LLM Inference in 2025

Published:
Why You Should Buy an AMD machine for Local LLM Inference in 2025

We’ve covered why NVIDIA consumer cards hit a 32GB wall and why Apple’s RAM pricing is prohibitive. Now let’s talk about the actual solution: AMD Ryzen AI Max+ 395 with 128GB unified memory.

This is the hardware I chose for my home LLM inference server. Here’s why.

It’s Open, Baby!

In contrast to the two big competitors NVIDIA and Apple, AMD has a huge amount of their stack open source. What CUDA is for NVIDIA and MLX for Apple, that’s ROCm for AMD. It’s fully open source, available on GitHub, and sees a huge amount of activity. This not only gives me a warm and fuzzy feeling, but also a lot of confidence that this stack will continue to go in the right direction.

The Hardware That Changes the Game

AMD Ryzen AI Max+ 395 offers something unique in the prosumer market:

  • 128GB of fast unified memory (96GB available to GPU)
  • Integrated GPU with discrete-class performance
  • Complete system cost: 2000-2500 Euro
  • Less than half the cost of the equivalent Mac Studio!

To make this more concrete: you can run a 70B model quantized to 4-bit (~38GB) and still have 50GB+ for context. That’s enough for 250K+ token contexts, legitimately long-document processing, extensive conversation history, and complex RAG workflows.

Looking a bit into the future, it’s not hard to imagine AMD shipping the system with 256 gigabytes of RAM for a reasonable price. It’s very hard to imagine Apple shipping a 256 gigabytes machine for a reasonable price. It’s just how they make their money.

Comparison to the DGX Spark

The recently released DGX Spark is a valid competitor to AMD’s AI Max series. It also features 128GB of super unified memory. From a pure hardware value perspective, the NVIDIA DGX Spark is the most compelling alternative on the market in October 2025. Street price is around 4500 Euro right now, almost double. You get a beautiful box with very comparable hardware and better driver support. You even get a good starting point to do your first experiments, like downloading LLMs and training your model. But everything you build on is closed source. You’re 100% dependent on NVIDIA staying on top of the game, on a machine that doesn’t make a lot of money for NVIDIA. I’m not that optimistic.

With the recent explosion of speed and everything in software with the help of coding agents, I’m not confident any company can stay on top of all of that. Especially not a company that earns their biggest profits in this sector.

Also the NVIDIA DGX Spark is Arm-based, which isn’t a problem for inference and training, but for another use case which is becoming important.

Running Apps and LLMs Side by Side

If you are doing LLM inference on a local machine, the easiest setup is to also run the apps needing the inference on the same machine. Running two machines is possible but opens a huge can of worms of problems. Even though it might not make sense intuitively, such distributed systems are complex. Not double complex, more like exponentially complex. Here’s a golden question from 10 years ago on Stackoverflow, trying to explain it.

So running everything on one machine is much simpler. With AMD you’re staying on the most common CPU architecture available x86-64. With the DGX Spark, you’re in Arm land. This architecture is gaining traction, but still a far way from being universally supported. If you’re planning to experiment with a lot of small open source dockerized apps like I do, this is a big plus for the AMD route.

The Driver Reality

This is the real trade-off: AMD’s software support lags behind NVIDIA and Apple by 1-3 months for bleeding-edge models.

As we discussed in our Qwen3-Next case study:

  • vLLM doesn’t officially support gfx1151 (the Ryzen AI 395’s GPU architecture) yet
  • For architecturally novel models, you’re waiting on llama.cpp implementations
  • ROCm 7.0 works well for established models, but cutting-edge architectures take longer

Important context: This is about bleeding-edge model support, not general capability. I run Qwen3 32B, Llama 3.1 70B, DeepSeek, and multimodal models without issues. The hardware is capable, the ecosystem just needs time to catch up. When and if AMD really catches up is unknown. I just want to make clear it’s a bet.

Why Not Regular AMD GPUs?

Before we conclude, let’s address another obvious question: what about regular AMD GPUs?

AMD Radeon AI PRO R9700 (32GB) or similar:

  • Consumer price point (1400 Euro)
  • 32GB VRAM
  • Same problem as NVIDIA consumer cards, but cheaper

These cards face the same memory ceiling as NVIDIA consumer cards. Yes, driver support has improved significantly with ROCm 6.x and 7.0. But you’re still dealing with the fundamental limitation. They’re cheaper, so you can stack them together, like Level1Techs does.

Two reasons speak against this: First, you’re building a highly custom machine, with all sorts of compatibility issues. Second, with 300W each, this is a huge power draw.

Conclusion

The Ryzen AI Max+ 395 is special because it’s the only prosumer-priced hardware offering 128GB of unified memory accessible to the GPU, coming in a standardized package with decent energy efficiency.

Previously: Why you shouldn’t buy an NVIDIA GPU and Why you shouldn’t buy into the Apple ecosystem.

This concludes our three-part hardware series. The message is simple: 128GB unified memory at a reasonable price changes everything for local LLM inference, and right now, AMD is the only one delivering that.

Why you shouldn't buy into the Apple ecosystem for local LLM inference

Published:
Why you shouldn't buy into the Apple ecosystem for local LLM inference

Apple Silicon Macs are engineering marvels. The unified memory architecture works beautifully for AI workloads, MLX provides excellent framework support, and the hardware delivers impressive performance. As we saw in our deep dive on running Qwen3-Next-80B, Macs can run large models with excellent inference speed.

But here’s the hard truth: Apple’s RAM pricing model makes their hardware prohibitively expensive for local LLM inference.

This is the second in a three-part hardware series. Part one covered why NVIDIA consumer cards fall short. Now let’s talk about why Apple’s pricing ruins what would otherwise be a very good solution.

What Apple Gets Right

Let’s start with what makes Apple Silicon genuinely impressive for LLM inference:

Unified Memory Architecture

This is where Apple’s engineering truly shines. Unlike traditional systems where CPU and GPU have separate memory pools requiring constant data copying, Apple Silicon uses one unified memory pool accessible to everything.

Here’s why this matters for LLM inference:

No data copying overhead: When you load a model, it sits in memory once. The GPU doesn’t need to copy data from CPU RAM. The Neural Engine can access it directly. There’s no PCIe bottleneck, no memory duplication, just direct access.

Memory bandwidth: This is where Apple Silicon separates itself from the competition. The memory controllers are integrated directly into the SoC, enabling extremely high bandwidth:

M4 Pro (Mac Mini M4 Pro):

  • Up to 273 GB/s memory bandwidth
  • 64GB max configuration
  • 2 TB SSD to hold some llm models
  • no option for a second SSD
  • Price: 3200 Euro
  • Excellent value for most LLM use cases

M4 Max:

  • Up to 546 GB/s memory bandwidth
  • 128GB max configuration
  • 2 TB SSD
  • Price: 4800 Euro
  • Highest bandwidth in the consumer space

Why bandwidth matters: LLM inference is memory-bound. You’re constantly reading model weights and KV cache from memory. With 546 GB/s on the M4 Max, you can feed the GPU fast enough to maintain high token generation speeds even with massive models. This is 2-3x the bandwidth of typical DDR5 systems and far exceeds what you get with discrete GPUs (which are limited by PCIe bandwidth for system RAM access).

For comparison, a typical high-end DDR5 system might deliver 100-150 GB/s. AMD’s Ryzen AI Max+ 395, delivers M4 Pro levels around 250 GB/s. Apple’s M4 Max at 546 GB/s is in a league of its own.

This bandwidth advantage is why Macs can achieve 50+ tokens/sec on 70B-80B models despite having less raw compute than some alternatives. You’re not waiting on memory access—the bottleneck is genuinely compute, not memory bandwidth.

For LLM inference, this unified memory architecture with massive bandwidth is exactly what you want. It’s genuinely impressive engineering.

Excellent Driver Support

The MLX framework provides day-one support for novel model architectures. When Qwen3-Next-80B dropped with its new architecture, MLX had it running immediately. No waiting for driver updates, no compatibility hacks.

The Problems: Pricing and macOS

Here’s where it all falls apart: Apple’s RAM pricing is absurd. It’s like this for a long time. As Apple is a highly profitable company, I see zero chance of this changing in the near future. Actually, this is an understandable strategy to keep their high margins.

Even if you consider paying the premium to run on Apple Hardware, I am not recommending to run them as servers. Apples macOS is an operating system built for personal computers. While they advanced headless usage, it’s still an afterthought for Apple. Running a secure node is just not that easy and forces you to work against the OS. Running Linux is somehow possible, but also an edgecase for these distributions. The last thing you want in a quickly developing hardware ecosystem is to be a edgecase, as this usually leads to driver and other obscure problems.

The Takeaway

Apple Silicon is legitimately impressive hardware. The engineering is excellent, the performance is strong, and the software ecosystem is mature.

But Apple’s RAM pricing and macOS ruin what would otherwise be the perfect solution for local LLM inference.

For price-conscious builders who want maximum memory for local LLM inference, Apple simply doesn’t make financial sense.

Previously: Why you shouldn’t buy an NVIDIA GPU - the 32GB limitation problem.

Next: Why you should buy an AMD machine - the actual solution that gives you fast 128GB of RAM for half the price.

Why you shouldn't buy an NVIDIA GPU or the DGX Spark for local LLM inference in 2025

Published:
Why you shouldn't buy an NVIDIA GPU or the DGX Spark for local LLM inference in 2025

When you’re shopping for hardware to run LLMs locally, the conversation typically starts with NVIDIA GPUs. They have the best driver support, the most mature ecosystem, and work with everything. But here’s the problem: consumer NVIDIA cards are hitting a hard ceiling that makes them unsuitable for modern local LLM inference.

This is the first in a three-part series on hardware options in 2025. We’ll cover why NVIDIA consumer cards fall short, why Apple’s ecosystem pricing is prohibitive, and ultimately, what you should actually buy.

The 32GB Wall and custom builds

Let’s start with the most established path: NVIDIA GPUs.

NVIDIA has the best driver support, the most mature ecosystem, and the widest software compatibility. If you’re running vLLM, Transformers, or any PyTorch-based inference stack, NVIDIA just works.

The problem? Most consumer and prosumer NVIDIA cards top out at 32GB of VRAM. On top, there’s a reason why NVIDIA’s stock price is soaring. They demand a huge premium for their products:

  • RTX 5090: 32GB, 2500+ Euro
  • RTX 3090: 24GB, 1300+ Euro
  • (Professional cards like the RTX 6000 Pro Blackwell: 96GB, 8,000+ Euro)

With one of these in the shoppingcart you just have the graphics card, leaving out the remaining components like CPU, RAM, SSDs, power supply and a case to house it all. There is no standard around building this, everything is possible. While this may sound cool, this is a maintenance nightmare. Do I see this special behavior on this machine because I have an Intel CPU and an ASUS motherboard? You can never rule this out, and usually you don’t have a second system to do a comparison with easily. Stability requires uniformity, there’s no way around it.

Quantization Explained

Before we talk about memory constraints, let’s understand how model size changes with quantization.

Modern LLMs store parameters as numbers. Quantization reduces the precision of these numbers to save memory. Here’s what that looks like for two common model sizes:

32B Model (like Qwen3 32B):

  • FP16 (16-bit): ~64GB
  • 8-bit quantization: ~32GB
  • 4-bit quantization: ~16-18GB
  • 3-bit quantization: ~12-14GB

70B Model (like Llama 3.1 70B):

  • FP16 (16-bit): ~140GB
  • 8-bit quantization: ~70GB
  • 4-bit quantization: ~35-38GB
  • 3-bit quantization: ~26-28GB

The rule of thumb: divide the parameter count by the bits per parameter to get approximate memory usage. A 32B model at 4-bit quantization needs roughly 32B ÷ 2 = 16GB, plus some overhead for model structure.

Lower quantization (3-bit, 4-bit) saves memory but reduces model accuracy slightly. For most local inference use cases, 4-bit quantization offers the best balance: you keep ~95% of model quality while cutting memory usage to ~25% of the original.

But here’s the catch: the model weights are only part of the story.

Why 32GB Isn’t Enough Anymore

Here’s the thing people miss: GPU memory doesn’t just hold the model weights. It also holds the KV cache: This stores the attention keys and values for your entire context window. Avid agentic coders know this as their context, and it’s always running out too fast!

Let’s do the math on a practical example:

Qwen3 32B (4-bit quantization):

  • Model weights: ~18GB
  • You have 14GB left for context on a 32GB card
  • The KV cache size per token depends on the model’s architecture. A typical 32B model consumes around 0.5 to 1.0 MB of VRAM per token in the context window
  • With 14GB remaining: 14,000 MB / 0.5 MB per token = ~28,000 tokens of context (or as low as ~14,000 tokens with FP16 KV cache)

That sounds like a lot—until you understand what actually consumes those tokens in real-world usage.

What Eats Your Context Window

Here’s what fills up those 28,000 tokens in practice:

System prompts and instructions: 500-2,000 tokens

  • Base system prompt defining the agent’s behavior
  • Task-specific instructions
  • Safety guidelines and constraints

MCP (Model Context Protocol) plugins and tools: 2,000-5,000+ tokens

  • Tool definitions for each MCP server
  • Function schemas and examples
  • Return value specifications
  • Multiple plugins stack up quickly

Conversation history: Variable, but grows fast

  • Your messages to the agent
  • Agent’s responses
  • Multi-turn back-and-forth
  • A 20-message conversation can easily hit 10,000+ tokens

Retrieved context: 10,000-50,000+ tokens

  • Document chunks pulled from vector databases
  • Code files for context
  • API documentation
  • Knowledge base articles

Working memory for long-running tasks: The killer use case

  • Agent exploring a codebase
  • Multi-step research tasks
  • Complex debugging sessions
  • Building features across multiple files

This last point is crucial: long context = agents can work independently for longer before needing a reset. If your context fills up after 20,000 tokens, your agent might need to restart after 10-15 tool calls. With significantly more tokens available (100K+ on systems with more memory), the agent can run through dozens of tool calls, maintain full context of what it’s tried, and make better decisions.

And if you want to run a larger model—say, a 70B quantized to 4-bit (~38GB)—you can’t even fit it on a 32GB card, let alone leave room for meaningful context.

The verdict: NVIDIA GPUs are excellent for established models and production workloads where you control the infrastructure. But consumer cards hit a hard ceiling when you want both large models AND extensive context.

What About NVIDIA’s DXG Spark?

NVIDIA’s Grace Blackwell DGX Spark has an interesting spec:

  • 128GB unified LPDDR5x memory
  • 20 core ARM processor
  • 1 petaflop AI performance
  • 10 GbE, ConnectX-7 Smart NIC
  • pricepoint between 3000 and 4000 Euro.
  • new NVFP4 format

On paper, this is a game-changer for local LLM inference. The most interesting thing is the new NVFP4 format, allowing for more aggressive quantization. Research has shown that models can be trained to fit into this quantization with almost the same performance. As this is a hardware supported feature, it can’t be copied quickly from Apple or AMD.

But the machine is far from available: It was announced for a long time, now we have seen the first videos, but also only sold out shops so far.

Even if it becomes available, this is a subsidized offer from NVIDIA. They are expecting very high margins for their products. By the nature of the price, this cannot maintain typical nvidia margins. They want you to buy into this ecosystem and then once it works locally go the most convenient way of running it on NVIDIA hardware in the data center as well. This machine is a mechanism to lock you into their ecosystem.

Most of the things Nvidia does are closed source: CUDA is closed source, the operating system running the DGX Spark DGX OS 7 is closed source, the list goes on. We’ve seen the play before, and we’re not eager to run into the trap again.

The Memory Wall Is Real

Here’s what’s become clear: local LLM inference is increasingly memory-bound, not compute-bound.

Modern MoE architectures like Qwen3-Next demonstrate this perfectly:

  • 80B total parameters, only 3B active per token
  • Inference speed is excellent—if you have the memory to hold it
  • Context window length matters as much as model size

The bottleneck isn’t “can my GPU compute fast enough?” It’s “can I fit the model AND enough context to do meaningful work?”

This is why 32GB cards are increasingly limiting. They work for smaller models or shorter contexts, but can’t handle the frontier of what’s possible with local inference.

This post is part of a three part series. Next week: Why you shouldn’t buy into the Apple ecosystem.

Custom Slash Commands, Part 3: The Installer

Published:
Custom Slash Commands, Part 3: The Installer

In part one, custom slash commands automated repetitive tasks but gave inconsistent results. In part two, we found the fix: separate the conversational prompt from a deterministic script.

Now, the finale. After weeks of hardening this approach, I’ve distilled it into three patterns that transform scripts into powerful self-fixing installers. Here’s what I learned building the TreeOS production setup.

The Three Patterns

1. The Self-Fixing Loop

The old way: run the script, watch it fail, open the file, find line 52, guess a fix, save, run again. High-friction context switching.

The new way: I let Claude execute the script, it fails on an edge case, and it comes back to me with what happened and multiple options how to fix it. Claude has the full context: the command, the code, and the failed output. It updates the script immediately. The script hardens with each real-world failure. This tight feedback loop is the fastest way to build robust automation. TreeOS will be open source. Users can run the install script and contribute a pull request if they encounter an edge case. Can’t wait to see this in real life.

2. Soft Front Door + Hard Engine

Every installer consists of two side-by-side files:

  • Soft Front Door (.md): treeos-setup-production.md

  • Hard Engine (.sh): treeos-setup-production-noconfirm.sh

The treeos prefix is to separate my custom commands from others. The markdown contains the Claude Code prompt, the conversational layer. It explains what’s about to happen, checks prerequisites, and asks for confirmation. It’s flexible and human-friendly.

The shell script is the deterministic engine. It takes inputs and executes precise commands. No ambiguity, no improvisation, 100% repeatable.

This separation is crucial. Claude can safely modify the conversation in the front door without breaking the logic in the engine. The naming convention makes the relationship obvious.

3. The Graceful Handoff

Depending on the machine and trust level of the user, sometimes Claude Code has access to sudo, sometimes not. The pattern: check if sudo is available without a password prompt.


sudo -n true 2>/dev/null && echo "SUDO_AVAILABLE" || echo "SUDO_REQUIRED"

If sudo requires a password, the front door hands off cleanly:


⚠️ This script requires sudo privileges.

Claude Code cannot provide passwords for security reasons.

I've prepared everything. Run this one command:

cd ~/repositories/ontree/treeos

sudo ./.claude/commands/treeos-setup-production-noconfirm.sh

Paste the output back here, and I'll verify success.

Claude does 95% of the work, then asks me to handle the one step it can’t. Perfect collaboration.

The Real-World Result

These three patterns and a lot of iterations produced my TreeOS production installer. It’s now 600+ lines and handles:

  • OS detection (Linux/macOS) and architecture

  • Downloading the correct binary from GitHub releases

  • Creating system users with proper permissions

  • Optional AMD ROCm installation if a fitting GPU is detected

  • Service setup (systemd/launchd) and verification

When something breaks on a new platform, the self-fixing loop makes improvements trivial. I’ve hardened this across dozens of edge cases without dreading the work.

Why This Changes Everything

Traditional README files demand a lot from the user. They push the cognitive load onto users: identify your platform, map generic instructions to your setup, debug when it breaks.

This flips the script. Instead of static documentation describing a process, we have executable automation that performs it.

But this isn’t just about installers. Apply these patterns to any complex developer task:

  • /setup-dev-environment clones repos, installs tools, and seeds databases

  • /run-migration backs up production, runs the migration, and rolls back on failure

  • /deploy-staging builds containers, pushes to registries, and updates Kubernetes

We’re moving from documentation that describes to automation that executes, with AI as the safety net and co-pilot. This is the future of developer experience: reducing friction by automating complex workflows around code.

With the explosion of AI tools, setup complexity is a real barrier. These patterns are one step towards changing that.