This is fantastic. I haven't got any local inference as I can't afford it right now, but tool calling has been a concern for me with these smaller models through OpenRouter.
I've been working on a pytest-first acceptance testing framework called Dokimasia (do-kee-ma-see-ah) that I'd love to get your thoughts on: https://github.com/deevus/dokimasia
Acceptance testing might not be what you need for Forge, but since you're deep in AI tool building I thought you may have opinions.
Oh, interesting idea. Formalizing an abstraction layer for testing all the integration types out there in the AI ether, essentially? MCP, skills, etc.
I think this sits a level higher than Forge - maybe testing the workflow proper and integration points that it might surface (if some tools are giving access to an MCP or something).
Could likely layer both together without much trouble.
Only thing I'd be curious about is how you handle the non-deterministic nature of these models. Sometimes they get the tool call right, sometimes they barf bad json. Does the suite run multiple trials?
I've been using DeepSeek V4 a lot in the last week and I am very happy with it. If you have a really gnarly bug, you might need a SOTA model like Opus. For most things it is very very good, and costs significantly less (even without the discount).
I've been using it as part of a complex DOS game decompilation project[0]. I'm working on refactoring the software rendering pipeline so that we can add GPU rendering. The hardest part of this so far is converting the 90's polygon rendering from screen to world space.
It spun its wheels a few times doing a large mostly mechanical change. After resetting and improving my prompts it was able to get through it. I'm using Matt Pocock's skills[1] for this work, which has been quite nice.
I run a solo consultancy out of Wollongong. Recent work: built a zero-trust key management system in Zig with a Python middleware layer for Byterix Labs (Singapore, scam prevention), spent twelve months on SwitchDin's platform team optimising Python ETL and Django (Newcastle, distributed energy resources), and ship a cross-platform Rust + React desktop disk cleaner called Reclaimr. Before that, eight years at Bonjoro: first senior engineering hire through to CTO, running a team of six on AWS Lambda + MySQL Aurora + Redis. Co-founder of gdzig (Zig bindings for Godot, 180 stars) and co-founding maintainer of Scoop (24k+ stars). Two front-page HN posts this year on Apple Silicon vs Framework laptops and applying Andrew Kelley's "Programming Without Pointers" technique to a real Zig project.
I’ve been working on a sandboxing tool that uses Incus. Originally it was only to run LLMs inside a sandbox, but recently I added MCP so that an agent could spin one up and do work that way.
It currently only exposes a rudimentary set of tools which I’d like to expand. The sandboxes created by MCP are generally ephemeral. The daemon will clean them up after an hour of no usage.
But it’s so cool that they get their own IP and you can ssh straight in. I can see that being very useful when you want to share with a colleague and then close your laptop (assuming it’s running on a remote instance).
It supports running on a TrueNAS SCALE server, or via Incus (local or remote). I'm still working on tightening the security posture, but for many types of AI workflows it will be more than sufficient.
Location: Wollongong, Australia (UTC+10/+11)
Remote: Yes
Willing to relocate: No
Technologies: Zig, C, Terraform, Docker, Linux, Go, Python, Laravel, Node, React, TypeScript, Flutter/iOS, Godot, PostgreSQL, AWS
Resume/CV: https://github.com/deevus
Email: See GitHub profile
LinkedIn: https://linkedin.com/in/deevus
Solo consultant (Simon Hartcher Services) and former startup CTO. Co-founding maintainer of Scoop (23k+ stars). True full-stack developer comfortable at every level of the stack.
Looking for: contracts (short or long-term) or full-time remote. Interested in infrastructure tooling, dev tools, systems programming, or anything Zig-adjacent. 4-day workweek is a plus.
I've been working on a pytest-first acceptance testing framework called Dokimasia (do-kee-ma-see-ah) that I'd love to get your thoughts on: https://github.com/deevus/dokimasia
Acceptance testing might not be what you need for Forge, but since you're deep in AI tool building I thought you may have opinions.