My AI Workflow Journey: From Claude Pro to a Multi-Model Stack

I've been meaning to write this for a while. Not because it's a tutorial or because I have some groundbreaking take — but because I went through a genuinely messy, trial-and-error process figuring out how to use AI effectively in my day-to-day work, and I think the messy version is more useful to read than the polished "here's my perfect setup" version.

So here's the honest story.

It started with Claude Pro

My first real AI subscription was Claude Pro. $20/month. And honestly? The quality blew me away immediately. The reasoning felt different from what I'd tried before — more thoughtful, less pattern-matchy. I remember the first time I threw a genuinely tricky architecture problem at it and it actually helped me think through tradeoffs instead of just throwing code at me.

But I hit two walls pretty fast.

The first one was rate limits. I'm the kind of person who works in bursts — I'll spend an entire afternoon deep in a problem, asking question after question, iterating constantly. Claude Pro's monthly quota wasn't built for that kind of intensity. I'd burn through it in a week and spend the rest of the month being stingy with prompts, which completely kills the flow.

The second wall was something I call the spinning problem. When I got stuck on a hard bug and kept pushing Claude to fix it, it would sometimes just... keep rewriting large chunks of code without actually finding the root cause. Like watching someone repaint a wall instead of fixing the crack behind it. It was frustrating because the model was clearly capable — it just needed better direction from me, which I didn't always have the patience for in the middle of a debugging session.

In hindsight, part of the problem was me using Opus for literally everything. That was overkill. But I didn't know better yet.

Then I tried Codex for a while

After bumping against Claude's limits one too many times, I switched to OpenAI's Codex for a stretch. It had a genuinely different feel for code — tracing logic and finding bugs felt natural, almost like it understood the shape of a problem differently.

But I ran into the exact same issue. Quota limits. Different company, same ceiling. I'd hit the wall mid-debugging session, mid-feature, mid-thought — always at the worst possible moment. It started to feel like no matter which tool I picked, I was being metered at the exact moment I needed headroom.

I needed to think differently about this.

Finding the Chinese models through nano-gpt

The question I started asking wasn't "which model is best?" — it was "which model lets me work without watching a timer?" That reframe led me somewhere unexpected.

I found nano-gpt while looking for alternatives, and through it I discovered two models I'd honestly never heard much about: GLM-5 and Minimax 2.7. The pricing stopped me cold: 60 million tokens per week for $8/month. That's not a typo. I double-checked.

I went in skeptical. I came out genuinely impressed — though with caveats.

GLM-5

GLM-5 surprised me the most. I expected a model that kind of worked. What I got was a model that handled frontend, backend, and agentic tool use with real competence. It wasn't perfect, but it was the kind of capable-enough that actually moves work forward.

The killer feature was having room to breathe. I could iterate without rationing prompts. I could go back and forth on a problem until it was actually solved. That changes how you work more than any single quality improvement would.

But there were two real problems.

The first was stability. GLM-5 timed out. A lot. Disconnected mid-session more times than I can count. When you're deep in a multi-step agentic task and the model drops, you lose more than just the connection — you lose context, momentum, sometimes work. It was genuinely disruptive. This isn't just a GLM issue — it's a provider issue. Whether you're accessing it through z.ai or through a third party reselling the API, the infrastructure instability follows you.

The second was the context window. 80,000 tokens sounds like a lot until you're working on a real codebase with a long session history. The compaction kicks in constantly. For smaller, well-scoped tasks it's fine. But for anything involving a large codebase or a long agentic run — it's a real ceiling. You feel it.

Minimax 2.7

Minimax is a different beast entirely. It's not trying to be autonomous or creative. It works best when you give it clear, well-scoped instructions and let it execute. Frontend work was a genuine weak spot — don't reach for it when you're building UI. But for backend tasks and well-defined commands, it was rock solid.

What Minimax had that GLM-5 lacked was reliability. It stayed connected. It finished what it started. There's something underrated about that — a model that consistently does what you ask, even if it doesn't always do it brilliantly, is often more valuable day-to-day than a brilliant model that drops unpredictably.

Where each model actually stands

After spending real time with all of these, here's my honest read:

Claude — still the best I've used for frontend, backend, DevOps, and tooling work. The reasoning quality and code output are genuinely ahead. The problem is that it's WAY expensive compared to the alternatives. If money weren't a factor, I'd probably just use Claude for everything.

GLM-5 — a very good competitor to Claude. Capable across most tasks, handles agentic work well, and the token generosity makes it practical to actually use at scale. But the tight context window means you're constantly fighting compaction on larger tasks, and the provider instability — whether through z.ai or third-party API resellers — is a real day-to-day problem, not a minor footnote.

Minimax 2.7 — best for small, clearly-defined tasks. Don't use it for frontend or UI work. Seriously. But for well-scoped backend tasks or scripting jobs where you just need something reliable to execute instructions, it delivers. It's the workhorse you reach for when you know exactly what you want done.

How I actually evaluate models now

After all this trial and error, I've landed on three things I actually care about when judging a model:

  1. Benchmark standing — I use artificialintelligence.ai as an external reference point. Not perfect, but useful for sanity-checking my own impressions.

  2. Agentic capability — Can it use tools? Use GitHub effectively from the CLI, and deal with other software skillfully? Run terminal commands? Call APIs? A model that can only chat has a hard ceiling on what it can actually do for me. This is non-negotiable now.

  3. Context window and how it handles degradation — This one is underrated. A huge context window means nothing if the model falls apart when it gets close to the limit. I've found that graceful compaction matters more than raw size. I track context window sizes across models on OpenRouter — it's the best place I've found for quick comparisons.

Where things stand today

After all of this, I've stopped looking for one model to do everything. That was the wrong frame from the start. My current setup is intentionally split — and how I access each model matters too:

Model What I use it for How I access it
Claude Opus 4.6 Hard problems, deep reasoning, complex architecture Claude Code
Claude Sonnet 4.6 Planning, design, high-level thinking Claude Code
GLM-5 Implementation — frontend and backend OpenCode
Minimax 2.7 Simple, clearly-defined tasks OpenCode

I use Opus and Sonnet through Claude Code, which gives me a proper agentic environment with tool use, file access, and terminal integration. GLM-5 and Minimax I run through OpenCode — same idea, different models. The split isn't just about cost — it's about each model being in the environment where it works best.

It's not elegant. It requires context-switching. But it works, because each model is doing the thing it's actually good at.

What I think happens next

My honest read is that Chinese models — GLM, Minimax, and whatever comes next from that ecosystem — are going to become a much bigger part of developer stacks in the next year or two. The cost-to-capability ratio is already competitive, and the gap is closing fast.

But two things need to happen before I'd fully commit to them as my primary tools:

Stability has to improve. The GLM-5 timeout issue isn't a minor inconvenience — it's a workflow killer. When you're running multi-step tasks that take 20 minutes to set up and the model drops at minute 18, that's not something you work around. That gets fixed at the infrastructure level or it doesn't get fixed.

Context windows need to grow. 80K tokens felt tight for the kind of work I do — large codebases, long sessions, lots of accumulated context. The Western frontier models have pulled ahead here. Until that gap closes, there's a ceiling on how much weight I can put on the Chinese models for serious engineering work.

If those two things happen — and I think they will — the economics make the shift almost inevitable. $8/month for the kind of throughput I was getting is hard to argue against, even if it's not quite there yet.

For now, I'm running the hybrid stack. It's working. And I'll probably write another one of these in six months when everything has changed again.

What I'm actually doing next

I've decided to push things further. The next step is going all-in on GLM-5 and Minimax — dropping the Claude fallback for a period and seeing what breaks. Not as a cost-cutting exercise, but as a genuine stress test. You don't really know a model's limits until you're leaning on it completely with no safety net.

I also want to test Kimi K2.5 properly. I've seen it come up repeatedly in discussions around complex reasoning tasks, and I haven't given it a real shot yet. That's the kind of thing I want to test systematically rather than just vibing with it for an afternoon and forming an opinion.

The bigger thing though is that most of my AI usage so far has been pretty narrow. Webapps. Simple scripts. The occasional architecture discussion. That's not a representative sample of what these models can do — or where they fall apart.

So I'm deliberately expanding the test surface:

  • Cybersecurity tasks — vulnerability analysis, writing proof-of-concepts, reverse engineering assistance, CTF problems. This is a domain where reasoning quality and precision really matter, and I'm curious how the Chinese models hold up under that pressure.
  • AI/ML tasks — not just using AI to write code, but using it to reason about models, architectures, training dynamics. Meta, but useful.
  • Longer agentic runs — multi-step tasks with real tool use, not just one-shot prompts. This is where stability issues show up most brutally.
  • Arabic-language work — I switch between Arabic and English constantly, and I've never properly tested how well GLM and Minimax handle technical discussions in Arabic.

I'll write up what I find. The goal isn't to crown a winner — it's to build an honest map of where each model is actually useful, and where it quietly fails you.

The longer-term ambition is to take the best of what I learn and build my own suite. Not a single model, but a curated stack of Chinese models — the latest versions of GLM, Minimax, Kimi, and whatever else proves itself — each assigned to the tasks it genuinely excels at. A setup that covers all my usecases without leaning on Western models at all. That's the experiment. I want to know if it's possible to build something that good, that affordable, and that reliable entirely from this ecosystem. I genuinely don't know the answer yet. That's why it's worth doing.