Cost Tracking
See which agents, models, prompts, and runs are driving AI spend before you optimize or change providers.
PromptRails calculates estimated cost for model calls and rolls that up from spans to executions and workspace summaries. Use cost tracking to understand which agents, models, and workflows are driving spend before you optimize prompts or switch models.
For PromptRails-hosted models, usage is also tied to the workspace balance in Billing. Cost Tracking shows what happened inside runs; LLM Gateway explains how hosted-model balance, free allowances, and transactions are managed.
Technical detailsCost formula and breakdown details
How Costs Are Calculated
Cost is based on:
- Token usage -- The number of prompt (input) and completion (output) tokens consumed
- Model pricing -- The per-token price for the specific LLM model used
- Automatic aggregation -- Span-level costs are rolled up to execution and workspace levels
The formula is:
cost = (prompt_tokens * input_price_per_token) + (completion_tokens * output_price_per_token)
PromptRails keeps model pricing metadata for supported providers and uses it to estimate costs automatically.
Per-Execution Cost Breakdown
Every execution includes total cost and token usage:
execution = client.executions.get(execution_id="your-execution-id")
print(f"Total Cost: ${execution['data']['cost']:.6f}")
print(f"Token Usage: {execution['data']['token_usage']}")
print(f"Duration: {execution['data']['duration_ms']}ms")For multi-step executions such as chains, workflows, and multi-agent runs, the execution cost is the sum of the LLM spans inside that execution.
Per-Span Cost Analysis
Drill down into individual spans to see cost at the LLM call level:
traces = client.traces.list(execution_id="your-execution-id", kind="llm")
for span in traces["data"]:
print(f"Model: {span['model_name']}")
print(f" Prompt tokens: {span.get('prompt_tokens', 0)}")
print(f" Completion tokens: {span.get('completion_tokens', 0)}")
print(f" Cost: ${span.get('cost', 0):.6f}")
print(f" Duration: {span['duration_ms']}ms")
print("---")This is valuable for identifying which step in a complex pipeline is the most expensive.
Token Usage Tracking
Token usage is tracked with three metrics:
| Metric | Description |
|---|---|
prompt_tokens | Number of tokens in the input sent to the model |
completion_tokens | Number of tokens in the model's response |
total_tokens | Sum of prompt and completion tokens |
Token counts are provided by the LLM provider and recorded with each LLM span.
Workspace-Wide Cost Summary
Get aggregated cost data across your entire workspace:
summary = client.costs.get_summary()
print(f"Total Cost: ${summary.total_cost:.2f}")
print(f"Total Executions: {summary.total_executions}")
print(f"Total Tokens: {summary.total_tokens}")Per-Agent Cost Analysis
Analyze costs broken down by agent to identify your most expensive workflows:
agent_costs = client.costs.get_agent_summary("your-agent-id")
print(f"Agent Total Cost: ${agent_costs.total_cost:.2f}")
print(f"Agent Executions: {agent_costs.total_executions}")
print(f"Agent Tokens: {agent_costs.total_tokens}")LLM Model Pricing
PromptRails maintains pricing information for all supported LLM models. Prices are stored as cost per token (or per 1K/1M tokens depending on the model) and are used for automatic cost calculation.
Supported providers and their models include:
| Provider | Example Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo |
| Anthropic | claude-3.5-sonnet, claude-3-opus, claude-3-haiku |
| gemini-pro, gemini-ultra | |
| DeepSeek | deepseek-chat, deepseek-coder |
| Fireworks | Various hosted models |
| xAI | grok-2, grok-3, grok-4 |
| OpenRouter | Aggregated pricing from multiple providers |
| Together AI | Hosted open models (Llama, Qwen, …) |
| Mistral | mistral-large, mistral-small, codestral |
| Cohere | command-a, command-r |
| Groq | Llama / Qwen on LPU inference |
| Perplexity | sonar, sonar-pro |
| AWS Bedrock | Claude, Llama, Nova, and more on AWS |
| Cerebras | llama3.1-8b, llama3.3-70b |
| SambaNova | DeepSeek-V3, Meta-Llama-3.1-70B |
| Hyperbolic | meta-llama/Meta-Llama-3.1-70B, DeepSeek-V3 |
| DeepInfra | DeepSeek-V4-Pro, Llama-3.3-70B, Gemma-4-31B |
| Novita AI | DeepSeek-V4-Pro, Qwen3.7-Max, Kimi K2.6 |
| Friendli AI | Llama-3.3-70B, GLM-5.1, MiniMax-M2.5 |
| Chutes AI | DeepSeek-V3, Qwen3-32B, Gemma-4-31B-TEE |
| Z.AI | glm-5.1, glm-4.7, glm-4.6v |
| Moonshot | kimi-k2.6, kimi-k2-thinking |
| DashScope | qwen-max, qwen-plus, qwen3.7-max |
| Hugging Face | DeepSeek-V4-Pro, Qwen3-Coder-Next (router access) |
Cached input tokens (prompt-cache read hits) are billed at the model's discounted cached-input rate when configured, so prices reflect the lower cost providers charge for cached prompt prefixes.
Cost Optimization Tips
- Choose the right model -- Use smaller, cheaper models (GPT-4o-mini, Claude 3 Haiku) for simple tasks and reserve expensive models for complex reasoning
- Use prompt caching -- Enable cache timeouts on prompts to avoid duplicate LLM calls
- Monitor per-agent costs -- Identify agents with unexpectedly high costs
- Set token limits -- Use
max_tokensto prevent runaway completions - Review chain/workflow costs -- Multi-step agents multiply LLM costs; optimize the number of steps
- Use data source caching -- Cache database query results to reduce data source query overhead
Related Topics
- Executions -- Execution-level cost data
- Tracing -- Span-level cost breakdown
- Billing and Plans -- Execution limits and plan pricing
- LLM Gateway -- Hosted-model balance, free allowances, and gateway usage