The chat box flattens design into a transaction: write a specification, receive a result, write another. It works for text. It struggles with the part of design that is spatial and felt, where you push something, see what it does, and adjust. A sentence is a narrow channel for intent that is really about shape, proportion, and how a building sits on its land.
So the question I keep returning to is what a creative tool for designing with AI should feel like once you stop typing at it. Language still has a place. The aim is to stop treating it as the only handle on the work. Several interface shapes could do that, each keeping a person in the loop rather than waiting at the end of it.
Sketch as intent
A drawing carries footprint, height, roof and orientation at once, so the model reads shape directly, without a paragraph standing in for it.
Push and react
Change the form and watch it answer back, so the loop stays spatial and close to sketching against a site.
One move per surface
The model joins each step on its own screen, one move at a time, so a single prompt never has to carry everything.
The flow has three moves. You sculpt the massing by hand with a few brushes: drawing a footprint, painting height, carving into the slope. When you are ready, the interface quietly captures several angles of what you made and hands them over. The result then takes over the screen as a 3D model you can move around, with everything else hidden except a way back. The controls stay deliberately minimal, so attention sits on shaping and reacting, not on filling in fields.
This sits inside a field with a name: mixed-initiative co-creativity. The framing from the human-computer interaction research is that a creative tool should put person and machine in a tight loop, where each one suggests, produces, evaluates, modifies and selects in response to the other. That loop is what makes the work feel co-authored.
The more useful finding for a builder is where these tools fail. Recent work points out that the main barrier to adoption is not the model's capability but cognitive friction: people quietly abandon powerful AI assistants when the interface makes them hold too much in their head or configure too much before they see anything. A parallel line of design research frames AI as a co-creator and a design material, something you shape with, not just a service you call.
The barrier is rarely the model. It is the friction of telling it what you mean. A sketch is a low-friction way to hand over intent, so the interface should let you shape, not specify.
That reframed the whole experiment. Sketching the massing is a way to give the model a lot of context, the footprint, height, roof and how it sits, without typing any of it. The staged screens exist to keep the loop legible: shape, react, shape again, with as little to manage at once as possible.
Two models do the work, each kept to one job. The first is Google's Gemini image model, used in edit mode: it takes the angle shots of the massing as reference images and returns a single concept render that follows the form. The second is TRELLIS, Microsoft's image-to-3D model, which lifts that render into a mesh I load straight into the viewer. Neither is asked to do the whole thing. The massing comes from the person, the look from the image model, the geometry from the 3D model.
Image-to-3D moved fast over the last two years, and the choice of model is now a real trade-off between fidelity, clean geometry, cost and whether you can run it yourself. The honest state of it in 2026:
| Model | Source | Best at | Notes |
|---|---|---|---|
| TRELLIS in use here | Open, Microsoft | Visual fidelity, PBR materials | TRELLIS.2 is this year's baseline; runs locally on one high-end consumer GPU; free |
| Hunyuan3D | Open, Tencent | High-fidelity geometry and texture | Strongest open alternative; self-hostable for researchers |
| Tripo | Hosted | Clean, game-ready topology | Fast; tidy edge flow and low polygon counts, little retopology needed |
| Rodin (Hyper3D) | Hosted | Top-end quality, 4K textures | ~10B params, quad topology, quality tiers; the premium option |
| Meshy | Hosted | All-round, 3D printing | Friendly, credit-based; good general default |
Under those products sit a few different techniques, worth knowing because they fail in different ways:
Mesh diffusion
A single image becomes a watertight mesh. What TRELLIS, Hunyuan3D, Tripo and Rodin do. Best for clean, single objects.
Gaussian splatting
The scene is millions of fuzzy points. Photoreal and great for captured scenes, but not tidy, editable geometry.
Multi-view diffusion
Generate several consistent views first, then reconstruct. Cuts the two-faced "Janus" errors single-image methods make.
For this experiment I want a clean single object I can drop into a viewer, so a mesh-diffusion model like TRELLIS fits. If the goal were a real captured site rather than a generated house, splatting or photogrammetry would be the lineage to reach for instead.
Most of the work is in the prompt, and it took a few passes. The first version simply said "turn this sketch into a house", which gave generic results that ignored what I had drawn. The version running now reads as an architect brief: it tells the model the images are several angles of the same massing, asks it to read them together for footprint, height, roof and orientation, then to design one coherent concept home with named materials and a covered deck on its land.
A few lines exist only to help the next step. A plain white background, the whole building centred, bright even light and no sky, because an isolated, evenly lit subject lifts into a much cleaner 3D model. Earlier results came out dark, which turned out to be the viewer rather than the image, so that was a lighting fix in the 3D scene as much as a prompt change. The prompt is hidden in the interface now, doing the heavy lifting out of sight.
Feeding the sketch in as the reference is what makes the result feel like yours. The house lands where you put it, at roughly the shape you drew, instead of arriving from nowhere. Capturing several angles rather than one gave the image model enough to reason about the form. And the staged interface, sketch first then model, kept each part of the process clear instead of crowding everything onto one screen.
The 3D is a visual mesh, not measured geometry: no real walls, rooms or dimensions, and the ground under it is generated, not a real site. Fidelity to the sketch depends on how clearly the massing reads, and the standard image model only follows it so far. It is good for feeling a house on its land and turning it over, not for anything that needs to be accurate.
The open question is which interface helps the creative process most: sketching like this, or something I have not tried yet. The follow-on questions are whether the sketch can carry more intent into the result, and whether the form can sit on real terrain while staying honest about what is concept and what is measured.
- Mixed-Initiative Creative InterfacesDeterding et al. · CHI 2017
- Boosting Mixed-Initiative Co-Creativity in Game DesignACM Computing Surveys
- AI as a co-creator and a design materialDesign Studies, 2025
- Best 3D model generation APIs in 20263DAI Studio
- TRELLIS.2: native and compact structured latents for 3D generationMicrosoft Research · GitHub