These days, AI is on everyone’s lips; I also use AI extensively in my own automation platform to streamline operational tasks. However, after a while, when the bills started piling up, I faced the unpleasant reality known as the “Tokenpocalypse.” While large language models (LLMs) are incredibly capable, every API call and every token comes with a cost. Especially in the MSP business, in multi-client environments, or with constantly running background automations, these costs can quickly snowball.
Having experienced this situation with my own systems, I began researching and implementing practical steps to reduce AI costs. In this post, I will share seven different ways to cut your AI bills in the “Tokenpocalypse” era, based on my own experiences. My goal is not just to give you theoretical knowledge, but also to provide concrete strategies that work in the field.
1. Starting with Local LLMs: The Balance of Sensitive Data and Cost
While cloud-based LLM services are convenient, it’s essential to consider local solutions for both security and cost, especially when dealing with sensitive data. In my own automation platform, I always keep this balance in mind when processing sensitive content like financial analyses or customer data. While I use services like the Claude API for more general analyses or public data, I’ve turned to local LLMs for processing critical data.
Running a local LLM might require some initial setup effort, but in the long run, it offers significant advantages in terms of both data security and cost. For example, with tools like Ollama, I can easily run various open-source models (like Mistral, Llama 3) on my own server or even a powerful workstation. This way, I prevent sensitive data from going to a third-party cloud service, and I don’t have to worry about per-token costs. My only cost is the hardware running the model itself.
Using local models is perfect for repetitive, high-volume, or low-latency tasks. For instance, when generating monthly status reports or searching for specific patterns in log analyses, instead of going to the cloud API every time, I use my local model, effectively zeroing out my cost. Here’s a representative configuration example showing how I run Ollama behind an Nginx reverse proxy with Docker Compose in my own system:
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# If you want to use a GPU:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
restart: always
nginx:
image: nginx:latest
container_name: nginx-proxy
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- ollama
restart: always
volumes:
ollama_data:
With this configuration, I run Ollama in a Docker container and make it securely accessible via Nginx. Nginx not only acts as a reverse proxy but also provides security layers like SSL termination. This allows my own automations or other services to securely access the local LLM.
2. The Art of Prompt Engineering: Fewer Tokens, More Work
When working with AI models, the prompts you write directly affect both the quality of the output and the amount of tokens consumed. While it’s possible to achieve the same task with far fewer tokens, a poorly written prompt can unnecessarily waste hundreds, even thousands of tokens. I’ve observed this repeatedly in my own automations; especially in initial attempts, when I wrote prompts too generally, I received responses with unnecessary details or repetitions.
Prompt engineering is truly an art and a skill that develops over time. Our goal is to tell the model what we want in the shortest, clearest, and most structured way possible. This not only saves tokens but also helps the model produce more accurate and usable responses. For example, when I want to summarize a text, instead of saying “Summarize this text,” saying “Summarize this text in no more than 100 words, highlighting critical points and including only the main idea” yields much more efficient results.
There are a few basic prompt engineering principles for token saving:
- Be Clear and Concise: Avoid unnecessary words and flowery language. Get straight to the point.
- Provide Examples (Few-shot prompting): Giving a few examples that match the expected output format helps the model better understand the expectation and saves on unnecessary “thinking” tokens.
- Specify Constraints: Clearly state the length, format (JSON, list, etc.), or information that should not be included in the output.
- Chaining Prompts: For complex tasks, it’s often more efficient to break them down into smaller, manageable steps with separate prompts, rather than using one massive prompt. The output of each step becomes the input for the next.
For example, to extract an error code and the relevant service name from a log entry, I might use a prompt like this:
Bad Prompt:
Look at this log entry and find the error code and service: "2023-10-27 14:35:01 ERROR [PaymentService] Transaction failed with code 500. User: Alice. Details: Connection timeout."
This prompt expects the model to fully understand the log entry and extract information from it, which can consume more tokens and sometimes lead to unnecessary explanations.
Good Prompt:
From the following log entry, extract the `ERROR` level error code and the service name that generated the error message in JSON format.
Log: "2023-10-27 14:35:01 ERROR [PaymentService] Transaction failed with code 500. User: Alice. Details: Connection timeout."
Format: {"error_code": "...", "service_name": "..."}
This prompt specifies the output format and directs the model directly to the information we want. It optimizes token usage by preventing the model from making additional explanations.
3. Narrowing the Context Window with RAG (Retrieval-Augmented Generation)
One of the biggest cost drivers for large language models is the amount of text we add to the context window. The more information we give a model, the more tokens we use, and the higher our bill. This becomes a serious problem, especially when we need to query long documents, comprehensive log entries, or large knowledge bases. This is where Retrieval-Augmented Generation (RAG) comes in.
RAG is an approach that allows the model to “retrieve” relevant information from an external knowledge base (documents, databases, web pages, etc.) before generating its response, and then enrich the model’s context by adding this information to the prompt. In my own automation platform, I extensively use RAG to quickly access information like specific configurations of my MSP clients, frequently asked questions, or past problem solutions, and to generate automated responses or triage suggestions using this information.
The basic idea behind RAG is to give up the expectation that the model “knows everything” and instead present it only with the necessary, pre-scanned, and indexed information for that moment. This way, by sending only a few hundred or thousand tokens of relevant information to the model’s context window, we can save much more than by sending a document of tens of thousands of tokens.
A simple RAG flow consists of these steps:
- Document Loading and Chunking: Divide the documents in your knowledge base (PDFs, text files, HTML, etc.) into small, meaningful chunks.
- Embedding Generation: Create an embedding (vector representation) for each document chunk. These vectors mathematically represent the semantic content of the text.
- Storing in a Vector Database: Store these embeddings (and the original text chunks) in a vector database (e.g., Qdrant, Chroma, Pinecone) that allows for fast searching.
- Query Embedding Generation: When a user’s question arrives, generate an embedding for this question as well.
- Similarity Search: Use the query embedding to find the most similar document chunks in the vector database.
- Prompt Creation: Take the found relevant document chunks and place them into the prompt to be sent to the LLM, along with the user’s original question.
- Response Generation: The LLM uses this narrowed and relevant context presented to it to generate the response.
This approach offers significant advantages in terms of both cost and accuracy, especially for SMBs with complex and extensive knowledge bases. For example, when I want to automate an IT support request, I take the user’s problem and first search my database of past solutions, product manuals, and FAQs using RAG. Then, I send the most relevant retrieved chunks to the Claude API or a local LLM to get a specific and accurate answer to the problem. This way, the model doesn’t have to “read” all the documentation every time.
4. Smart Caching: The Golden Rule for Repetitive Tasks
One of the biggest factors inflating AI service bills is the repeated querying of the same or similar requests. Especially in MSP operations like mine, tasks such as certain types of alert triages, standard report summaries, or answers to frequently asked questions can come up repeatedly with the same inputs. Instead of going to the LLM API every time, we can achieve significant cost savings by implementing caching mechanisms.
Caching means storing the response to a query for the next identical query. If the same query comes again, we retrieve the ready response from the cache instead of going to the LLM. This both zeroes out the cost and significantly speeds up the response time. Of course, caching is not suitable for every scenario. In situations where the response needs to be dynamic or contains constantly changing information, caching can be risky. However, for static or slowly changing information, it’s a golden strategy.
When designing a caching strategy, there are a few points to consider:
- Key Generation: We need to generate a unique cache key for each query. This key is usually derived from the prompt itself or a cryptographic hash of the prompt (e.g., SHA256).
- TTL (Time-To-Live): We need to determine how long a response in the cache will be valid. A very short TTL reduces the benefit of caching, while a very long TTL can lead to serving stale data. This duration needs to be set appropriately based on the use case.
- Invalidation Mechanisms: It’s important to have mechanisms to manually invalidate relevant cache entries if we know that information has changed.
- Storage Space: The potential size of the cache and how much storage space it will consume should also be considered.
A simple Python example of a Redis-based caching mechanism could be designed as follows:
import redis
import hashlib
import json
from datetime import timedelta
# Redis connection
r = redis.Redis(host='localhost', port=6379, db=0)
def generate_cache_key(prompt_text, model_name="gpt-3.5-turbo"):
"""Generates a unique cache key based on the prompt and model name."""
unique_string = f"{model_name}:{prompt_text}"
return hashlib.sha256(unique_string.encode('utf-8')).hexdigest()
def get_llm_response_cached(prompt_text, llm_api_call_function, cache_ttl_seconds=3600):
cache_key = generate_cache_key(prompt_text)
# Check if in cache
cached_response = r.get(cache_key)
if cached_response:
print("Retrieving response from cache...")
return json.loads(cached_response)
# If not, call the LLM API
print("Calling LLM API...")
response = llm_api_call_function(prompt_text) # Actual API call
# Save response to cache
r.setex(cache_key, timedelta(seconds=cache_ttl_seconds), json.dumps(response))
return response
# A representative LLM API call function
def mock_llm_api_call(prompt):
import time
time.sleep(2) # Simulate API delay
return {"text": f"This is a response from the LLM for '{prompt}'."}
# Usage example
prompt1 = "What's the weather like in Bursa?"
prompt2 = "What's the weather like in Bursa?" # Same prompt
prompt3 = "What's the weather like in Istanbul?"
print(get_llm_response_cached(prompt1, mock_llm_api_call))
print(get_llm_response_cached(prompt2, mock_llm_api_call)) # Will come from cache
print(get_llm_response_cached(prompt3, mock_llm_api_call))
In this example, the get_llm_response_cached function first checks the cache key in Redis. If a response exists, it returns it; otherwise, it calls the LLM API and caches the response. Even this simple mechanism has significantly helped me reduce my AI bill for certain query types. It’s perfect for queries we frequently encounter in MSP operations, such as “What does this error code mean?” or “What was the standard configuration for this device?“
5. Matching the Right Model to the Right Task: Price/Performance Balance
The world of AI models is vast; there are many options, from small, fast, and affordable models to massive, capable, but expensive ones. In my own automation platform and MSP work, I always make sure to choose the most suitable model for the task, rather than always using the largest and most capable one. This is a critical decision that directly impacts costs.
For example, using a high-end model like GPT-4 Turbo for a simple text classification or keyword extraction task is often an unnecessary expense. Instead, a more affordable model like GPT-3.5 Turbo or even an open-source model like Mistral running locally will more than suffice. My preference is to use the Claude API (especially its more affordable versions like Claude 3 Haiku) for general analyses or situations requiring more creative text generation, while deploying my local Ollama models for more structured and repetitive tasks.
When selecting a model, I consider the following factors:
- Task Complexity: How complex is the task? Does it require creativity, understanding subtle nuances, or multi-step reasoning? Or is it simple summarization or data extraction?
- Accuracy and Precision Needs: How accurate and precise does the response need to be? What is the potential cost of an incorrect response?
- Latency Requirements: How quickly does the response need to arrive? Is it a user-interactive application or a background batch job?
- Cost Constraints: What is your budget? Per-token costs can vary significantly between different models.
The table below shows representative model cost differences (actual figures vary by API provider and exchange rate):
| Model | Input Token Cost (1K tokens) | Output Token Cost (1K tokens) | Typical Use Case |
|---|---|---|---|
| GPT-4 Turbo | ~$0.01 | ~$0.03 | Complex analysis, creative writing, code generation |
| GPT-3.5 Turbo | ~$0.0005 | ~$0.0015 | Summarization, classification, chatbots |
| Claude 3 Haiku | ~$0.00025 | ~$0.00125 | Fast summarization, light chat, data extraction |
| Mistral/Llama 3 | Local (Hardware cost) | Local (Hardware cost) | Sensitive data processing, repetitive simple tasks |
Looking at this table, for example, if I’m just summarizing a text, I would opt for Claude 3 Haiku or GPT-3.5 Turbo instead of GPT-4 Turbo. When working with sensitive data, I would prefer the Mistral model on my own server. These conscious choices lead to significant reductions in my AI bill.
6. Output Control and Capturing Necessary Detail
The length of responses from AI models directly affects the amount of tokens consumed. Sometimes, the model can generate responses that contain much more information or unnecessary details than we want. This not only increases costs but also extends the time it takes to process the incoming data. Therefore, controlling the output from AI and ensuring it only provides the information we need is an important step in cost optimization.
At this point, the “specifying constraints” principle of prompt engineering comes into play. By clearly defining the format, length, and key information we expect from the model’s output, we can prevent unnecessary token consumption. In my own automation platform, I pay close attention to this, especially when generating report summaries or technical analyses. For example, when I ask for specific error codes from a log analysis, I expect the model to provide only the error code and the relevant service name in JSON format, not lengthy explanations for each error code.
Some strategies I can use for output control from the model:
- Specify Maximum Word/Token Limit: In the prompt, ask the model to “respond with a maximum of X words/tokens.”
- Specify Specific Format: Request a specific output format such as JSON, a list, or a bulleted list. This prevents the model from generating text outside the format.
- Request Only Necessary Information: Give clear instructions to the model like “provide only information X, Y, and Z.” Ask it to avoid additional explanations.
- Post-Process Output: If the model still generates unnecessary details,