The Claude API Endpoint That Tells You What a Request Will Cost Before You Send It

Most Claude API cost surprises are not surprises: count_tokens lets you measure a request before you send it, and the full usage object tells you exactly what it cost after.

Rick Hightower

Almost nobody calls count_tokens. The teams that do stop being surprised by their Anthropic bill. Here is the small, mechanical habit that turns cost from something that happens to you into something you decide.

In this article: You will learn how to use the Claude API's count_tokens endpoint to measure an exact input size before any tokens are billed, how to read every field of the usage object that comes back on a real call, why caching means your input total lives in three fields and not one, and how to wire a pre-flight check into a real service so an oversized prompt never sneaks past you again.

Most cost surprises on the Claude API are not really surprises. They are things you could have seen coming and chose, by default, not to look at. A support thread balloons to forty messages and you send the whole history on every turn. A user pastes a giant document into the prompt. A conversation quietly grows past the point where it is economical to keep resending. In every one of those cases, the information you needed, the exact size of the request, was available before you ever hit send. You just did not ask.

This article is about how to start asking. The Claude API has a small, under-used endpoint called count_tokens that returns the input token count for any request, without generating anything and without costing a generation. Paired with reading the usage object well after the fact, it gives you a complete picture: what a request will cost before you commit, and what it actually cost once it returns. That is the whole basis of running an LLM workload on a budget instead of on hope.

It is a small chapter in a long series, but the habit it builds separates a service you can forecast from one that bills you by ambush.

What count_tokens actually does

The endpoint is POST /v1/messages/count_tokens, and the official Python SDK exposes it as client.messages.count_tokens(...). You hand it the same messages, system, and tools you would send to a real messages.create call. It returns the number of input tokens that request would consume. The count includes the cost of tools, images, and documents, without actually creating the message. No generation happens, so there is no output, and no output-token charge.

import anthropic

client = anthropic.Anthropic()

count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Where is my order 48810?"}],
)

print(count.input_tokens)   # e.g. 14

The response is a small object with a single field, input_tokens: the total number of tokens across the provided messages, system prompt, and tools. That number is your input size for that exact request. Because you pass the full request shape, the count reflects everything. A long system prompt, a fat set of tool definitions, and an attached document all fold into the total, so what you measure is what you would actually be billed for on the input side.

The practical move this unlocks is a pre-flight check. Before sending a request that you suspect might be large, count it first, and decide what to do based on the number rather than firing blindly and reading the damage afterward.

A pre-flight loop: shape the request, call count_tokens with no charge, decide whether to trim or send, then read the usage object on the real response to confirm cost.

The usage object: what a request actually cost

Counting is the before. The usage object is the after. Every real, non-counting response from the Claude API includes a usage block, and it has more in it than the two fields most tutorials introduce. The fields you need to know:

  • input_tokens: the fresh, uncached input tokens for this request.
  • output_tokens: the tokens the model generated. The more expensive side per token, and it is never zero, even for an empty reply, because of how output is parsed.
  • cache_creation_input_tokens: tokens written to the cache on this call.
  • cache_read_input_tokens: tokens read from the cache on this call.
  • server_tool_use: counts of server-side tool requests, such as web searches, which carry their own charges.

The single most important thing to remember when budgeting is this: once caching is in play, your total input is not input_tokens alone. It is the sum of three fields.

total_input_tokens = input_tokens + cache_creation_input_tokens + cache_read_input_tokens

Read only input_tokens, and you will badly undercount a cached request's true size, or badly overestimate its cost if you forget that the cached portion is billed at a fraction of the rate. To estimate spend per request, take each of these counts and multiply by that token type's price. Fresh input, cached read, cache write, and output each have their own rate. Then sum the results.

The point is not the arithmetic. It is the discipline: every field in usage corresponds to a real line on your bill, so budgeting means reading all of them, not just the first.

A mindmap of every field on the usage object: fresh input tokens, output tokens, cache creation, cache read, server tool use counts, and the rule that total input is the sum of three input fields.

Counting a realistic request, not a toy

A bare user message is the boring case. The endpoint earns its keep when you count a realistic request, the kind your service actually sends, with a substantial system prompt and a full tool set attached.

count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system="...a long triage system prompt with rules and policy...",
    tools=[
        {
            "name": "get_order_status",
            "description": "Look up shipping status and ETA for an order by ID.",
            "input_schema": {
                "type": "object",
                "properties": {"order_id": {"type": "string"}},
                "required": ["order_id"],
            },
        }
    ],
    messages=[{"role": "user", "content": "Where is my order 48810?"}],
)

print(count.input_tokens)   # reflects system + tools + message together

The returned count now includes the system prompt and the tool definitions, not just the question. Which is exactly what you want, because that is what the real request would carry.

This is also a quiet way to see how much your system prompt and tools weigh, which is useful when deciding whether they are worth caching. A fat, stable prefix that counts in the thousands of tokens is a prime candidate for prompt caching: you pay the cache-write cost once, then every subsequent call reads the prefix at a fraction of the normal input rate.

What count_tokens actually weighs: the system prompt, the tools array, every user and assistant message, and attached documents and images, all collapsed into a single input_tokens integer that drives a caching decision.

A triage service that checks before it spends

Here is where counting becomes a guardrail rather than a curiosity. Support emails are unpredictable in length. Most are a paragraph. Occasionally someone pastes an entire forwarded thread, a wall of logs, or a chain of forty replies. Sending that straight to the model is how a routine classification suddenly costs many times what it should.

The fix is a pre-flight token check that trims an oversized inbound thread down to something sensible before it ever reaches the model:

import anthropic

client = anthropic.Anthropic()

MODEL = "claude-sonnet-4-6"
MAX_INPUT_TOKENS = 4000   # budget ceiling for a single triage request

def triage(email_text: str):
    messages = [{"role": "user", "content": f"Classify this support email:\n\n{email_text}"}]

    # Pre-flight: how big is this request before we send it?
    count = client.messages.count_tokens(model=MODEL, messages=messages)

    if count.input_tokens > MAX_INPUT_TOKENS:
        # Too big. Trim the email and re-measure rather than blindly sending.
        email_text = email_text[:8000] + "\n\n[thread truncated for length]"
        messages = [{"role": "user", "content": f"Classify this support email:\n\n{email_text}"}]
        print(f"Trimmed oversized email ({count.input_tokens} tokens) before sending.")

    response = client.messages.create(model=MODEL, max_tokens=256, messages=messages)
    print("Input billed:", response.usage.input_tokens,
          "| Output billed:", response.usage.output_tokens)
    return response

triage("Subject: HELP\n" + ("(forwarded reply) " * 2000))

Trace the safeguard. The forwarded monster email arrives. The pre-flight count flags it as over the four-thousand-token ceiling. The service trims it and re-measures before sending, then makes the real call and prints the actual usage so the before-estimate and after-fact line up. A request that might have cost many times the norm is brought back inside budget automatically, with no human watching the queue.

A sequence showing the triage service receiving a giant forwarded email, calling count_tokens, seeing 7,300 tokens against a 4,000 budget, trimming, re-measuring to 1,840, then sending and confirming the billed usage matches the estimate.

The whole technique is just two halves of one loop: count to decide, send, then read usage to confirm. Do that on every request that touches user-controlled input and "the bill spiked overnight" stops being a thing that happens to your service.

The lifecycle, end to end

It helps to see the whole arc of a single request once. Every call to the Claude API moves through the same states whether you measure it or not: you shape it, optionally estimate it, decide whether to send it, send it, and reconcile what came back. The discipline count_tokens introduces is making the estimating and reconciling steps explicit instead of skipped.

A state diagram of a single Claude API request lifecycle: shaping, estimating with count_tokens, deciding whether to trim or send, sending, then reconciling by reading the usage object and summing the three input fields.

The estimating state is free. The reconciling state is the only place where cost data is real. Skip either, and you are guessing.

Do this today

Three small changes that take less than an hour and immediately tighten your cost story:

  • Wrap one expensive endpoint with count_tokens. Pick the call in your service most likely to receive user-pasted text. Add a pre-flight call to client.messages.count_tokens(...) with the same messages, system, and tools. Define a ceiling. Trim or reject anything above it. Log both the estimate and the real usage so you can compare.
  • Log every field of usage, not just input_tokens and output_tokens. Add cache_creation_input_tokens and cache_read_input_tokens to your structured logs today, even if you have not turned caching on. The day you do turn it on, your dashboards will already be correct.
  • Count your system prompt and tools in isolation. Call count_tokens with your system and tools but a one-word user message. The number you get back is the fixed overhead every request in your service carries. If it is over roughly a thousand tokens and it does not change often, it is a prime candidate for prompt caching.

Cost stops being something that happens to you

You can now measure a request's input size before sending it with count_tokens, read the full usage object after to see what it truly cost, remember that a cached request splits its input across three fields and not one, and wire a pre-flight check into a real service so oversized inputs get caught instead of billed.

The shift this represents is small in code and large in posture. You stop reacting to the bill at the end of the month. You start deciding, on a per-request basis, what you are willing to spend. The same endpoint you ignored on day one becomes the thing that lets you sleep through a traffic spike.

So the next time you build a Claude API integration and your gut says "this prompt might be big," do not guess. Count it. The endpoint is right there. It is free. It is exact. And it is the difference between a service you forecast and a service that ambushes you.


This is Part 7 of "Building with the Claude API," an eleven-part guide that takes a developer from a first messages.create call to a hardened, observable, production-deployed integration.