Tell HN: I cut Claude API costs from $70/month to pennies

LTL_FTC · 2026-01-26T07:03:37 1769411017

It sounds like you don’t need immediate llm responses and can batch process your data nightly? Have you considered running a local llm? May not need to pay for api calls. Today’s local models are quite good. I started off with cpu and even that was fine for my pipelines.

kreetx · 2026-01-26T10:20:51 1769422851

Though haven't done any extensive testing then I personally could easily get by with current local models. The only reason I don't is that the hosted ones all have free tiers.

queenkjuul · 2026-01-26T08:44:17 1769417057

Agreed, I'm pretty amazed at what I'm able to do locally just with an AMD 6700XT and 32GB of RAM. It's slow, but if you've got all night...

ok_orco · 2026-01-26T18:16:02 1769451362

I haven't thought about that, but really want to dig in more now. Any places you recommend starting?

LTL_FTC · 2026-01-28T17:13:26 1769620406

I started off using gpt-oss-120b on cpu. It uses about 60-65gb of memory or so but my workstation has 128gb of ram. If I had less ram, I would start off with the gpt-oss-20b model and go from there. Look for MoE models as they are more efficient to run.

My old threadripper pro was seeing about 15tps, which was quite acceptable for the background tasks I was running.

ydu1a2fovb · 2026-01-26T13:15:24 1769433324

Can you suggest any good llms for cpu?

LTL_FTC · 2026-01-28T17:11:15 1769620275

I started off using gpt-oss-120b on cpu. It uses about 60-65gb of memory or so but my workstation has 128gb of ram. If I had less ram, I would start off with the gpt-oss-20b model and go from there. Look for MoE models as they are more efficient to run.

R_D_Olivaw · 2026-01-26T15:41:08 1769442068

Following.

LTL_FTC · 2026-01-28T17:14:00 1769620440

I started off using gpt-oss-120b on cpu. It uses about 60-65gb of memory or so but my workstation has 128gb of ram. If I had less ram, I would start off with the gpt-oss-20b model and go from there. Look for MoE models as they are more efficient to run.

44za12 · 2026-01-26T07:03:53 1769411033

This is the way. I actually mapped out the decision tree for this exact process and more here:

https://github.com/NehmeAILabs/llm-sanity-checks

homeonthemtn · 2026-01-26T12:45:24 1769431524

That's interesting. Is there any kind of mapping to these respective models somewhere?

44za12 · 2026-01-26T13:24:03 1769433843

Yes, I included a 'Model Selection Cheat Sheet' in the README (scroll down a bit).

I map them by task type:

Tiny (<3B): Gemma 3 1B (could try 4B as well), Phi-4-mini (Good for classification). Small (8B-17B): Qwen 3 8B, Llama 4 Scout (Good for RAG/Extraction). Frontier: GPT-5, Llama 4 Maverick, GLM, Kimi

Is that what you meant?

gandalfar · 2026-01-26T06:50:10 1769410210

Consider using z.ai as model provider to further lower your costs.

DANmode · 2026-01-26T07:12:37 1769411557

Do they or any other providers offer any improvements on the often-chronicled variability of quality/effort from the major two services e.g. during peak hours?

tehlike · 2026-01-26T07:08:09 1769411289

This is what i was going to suggest too.

viraptor · 2026-01-26T07:23:30 1769412210

Or minimax - m2.1 release didn't make a big splash in the news, but it's really capable.

ok_orco · 2026-01-26T18:16:14 1769451374

Will take a look!

deepsummer · 2026-01-26T08:38:31 1769416711

As much as I like the Claude models, they are expensive. I wouldn't use them to process large volumes of data. Gemini 2.5 Flash-Lite is $0.10 per million tokens. Grok 4.1 Fast is really good and only $0.20. They will work just as well for most simple tasks.

DeathArrow · 2026-01-26T07:47:13 1769413633

You also can try to use cheaper models like GLM, Deepseek, Qwen,at least partially.

joshribakoff · 2026-01-26T07:19:12 1769411952

Have you looked into https://maartengr.github.io/BERTopic/index.html ?

toxic72 · 2026-01-27T06:19:01 1769494741

consider this for addtl cost savings if local doesnt interest you - https://docs.cloud.google.com/vertex-ai/generative-ai/docs/m...

dezgeg · 2026-01-26T06:39:15 1769409555

Are you also adding the proper prompt cache control attributes? I think Anthropic API still doesn't do it automatically

ok_orco · 2026-01-28T06:05:40 1769580340

No I need to look into this!

arthurcolle · 2026-01-26T00:37:11 1769387831

Can you discuss a bit more of the architecture?

ok_orco · 2026-01-26T00:55:02 1769388902

Pretty straightforward. Sources dump into a queue throughout the day, regex filters the obvious junk ("lol", "thanks", bot messages never hit the LLM), then everything gets batched overnight through Anthropic's Batch API for classification. Feedback gets clustered against existing pain points or creates new ones.

Most of the cost savings came from not sending stuff to the LLM that didn't need to go there, plus the batch API is half the price of real-time calls.