This time we’ll mainly talk about several practical new capabilities in the Claude API: prompt caching, citations and search-result content blocks, and finer-grained control over tool calling. They’re not flashy, but they can noticeably affect cost, latency, and controllability. Below, we’ll quickly break them down from the perspective of “how you can use them.”
Prompt caching: store repeated system prompts in advance
If your Claude API use case includes a large amount of repeated system prompts (for example, unified customer-service scripting rules, fixed extraction formats, or long business context), prompt caching is a great fit. According to the official documentation, reusing cached prompts can reduce latency by up to about 80% and costs by up to about 90%, which is especially friendly for batch tasks.
In practice, it’s recommended to split out the “long-term unchanged parts” into a cacheable segment, and put the “user input that changes each time” separately in subsequent messages. This way, the Claude API can keep outputs consistent without charging you repeatedly for the same long prompt every time.
Citations and search-result content blocks: making RAG easier to do right
The Claude API already provides citation capabilities, used to attribute sources for key information in an answer. For knowledge-base Q&A or retrieval-augmented generation, citations can reduce the awkwardness of responses that “sound right but have no evidence,” and they also make it easier for you to display sources in the frontend for users to verify.
In addition, search-result content blocks have been promoted to an official capability, making them better suited for handing external retrieval results to the model in a “citable structure.” You can have the Claude API include citation markers when summarizing, and then decide on the application side whether to enforce a rule like “no citations, no conclusions.”


