<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Aravind's Blog]]></title><description><![CDATA[This blog is a collection of notes from the trenches — building resilient distributed systems and exploring the practical realities of AI in production.

I write about failure modes, trade-offs, and the responsibility engineers carry when designing systems that make decisions at scale.

Expect grounded reflections, architectural patterns, and experiments with AI — without hype, and without shortcuts.

Say - Engineering judgment in an age of automation.]]></description><link>https://aravindv.pro</link><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 08:53:51 GMT</lastBuildDate><atom:link href="https://aravindv.pro/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How I Lost a Credit Card Trying to Get One]]></title><description><![CDATA[There’s a certain kind of confidence that comes with upgrading a credit card.
It feels like a small step up — better limits, better perks, better… access to airport lounges you may or may not use.
So ]]></description><link>https://aravindv.pro/how-i-lost-a-credit-card-trying-to-get-one</link><guid isPermaLink="true">https://aravindv.pro/how-i-lost-a-credit-card-trying-to-get-one</guid><category><![CDATA[banking]]></category><category><![CDATA[credit card]]></category><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Thu, 09 Apr 2026 18:00:55 GMT</pubDate><content:encoded><![CDATA[<p>There’s a certain kind of confidence that comes with upgrading a credit card.</p>
<p>It feels like a small step up — better limits, better perks, better… access to airport lounges you may or may not use.</p>
<p>So I did what seemed reasonable.</p>
<p>I tried to get a better one.</p>
<hr />
<p>It started well.</p>
<p>There were conversations.<br />There was movement.<br />And, more importantly, there was <strong>assurance</strong>.</p>
<p>“It’s in progress, sir.”<br />“Approval is done.”<br />“We’ll initiate.”</p>
<p>At that point, I wasn’t just hopeful.</p>
<p>I was planning.</p>
<hr />
<p>Then came the process.</p>
<p>Not a process.<br />More like a collection of philosophical ideas loosely connected by follow-ups.</p>
<p>Every few days:</p>
<p>“Any update?”</p>
<p>And every time, there was one.</p>
<ul>
<li><p>“We’ll re-initiate.”</p>
</li>
<li><p>“It’s a long process.”</p>
</li>
<li><p>“Without invite it’s difficult.”</p>
</li>
<li><p>“But we’ve got approval.”</p>
</li>
</ul>
<p>This was my first introduction to the concept of:</p>
<blockquote>
<p>Schrödinger’s Credit Card<br />Approved and not initiated at the same time.</p>
</blockquote>
<hr />
<p>Documentation added depth to the experience.</p>
<p>My PAN card, for instance, turned out to be quite selective.</p>
<ul>
<li><p>Valid for some things</p>
</li>
<li><p>Not valid for others</p>
</li>
<li><p>Valid if signed</p>
</li>
<li><p>Not valid if photocopied</p>
</li>
<li><p>Possibly valid depending on mood</p>
</li>
</ul>
<p>I began to suspect the issue wasn’t the document.</p>
<p>It was me not understanding its personality.</p>
<hr />
<p>But the real turning point was subtle.</p>
<p>At some point, I stopped being excited about the new card<br />and started getting tired of the process.</p>
<p>And when that happens, you start simplifying.</p>
<hr />
<p>“Let’s not proceed with the card.”</p>
<p>A simple message. No drama.</p>
<hr />
<p>And then, in a moment of quiet clarity (or frustration, depending on how you look at it), I did something unexpected.</p>
<p>I closed my existing credit card.</p>
<hr />
<p>So just to recap:</p>
<p>I started with one credit card.<br />Tried to get a better one.<br />And ended up with… fewer.</p>
<hr />
<p>If you think about it, it’s quite efficient.</p>
<p>No upgrade.<br />No confusion.<br />Just subtraction.</p>
<hr />
<p>The interesting part is not the outcome.</p>
<p>It’s what the process tells you.</p>
<p>Because getting something should feel like movement.</p>
<p>Not like maintenance.</p>
<hr />
<p>Somewhere between “approval” and “re-initiation,”<br />I realised something simple:</p>
<blockquote>
<p>If something this small can’t be taken to closure,<br />it’s worth thinking about how larger things would be handled.</p>
</blockquote>
<hr />
<p>The new card may still happen someday.</p>
<p>Maybe it’s still in progress.</p>
<p>But for now, I have one less card,<br />and one more story.</p>
]]></content:encoded></item><item><title><![CDATA[What Zero Doesn’t Tell You]]></title><description><![CDATA[The Comfort of a Number
Zero is comforting.
It suggests completion. Certainty. A clean state. In systems that are otherwise complex and difficult to reason about, a number like zero offers something r]]></description><link>https://aravindv.pro/what-zero-doesn-t-tell-you</link><guid isPermaLink="true">https://aravindv.pro/what-zero-doesn-t-tell-you</guid><category><![CDATA[Security]]></category><category><![CDATA[securityawareness]]></category><category><![CDATA[measurement]]></category><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Tue, 31 Mar 2026 17:20:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/694e3ae41267e9e911e72c10/696bf18d-6998-4fc5-91b4-e23d7fb1012a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Comfort of a Number</h2>
<p>Zero is comforting.</p>
<p>It suggests completion. Certainty. A clean state. In systems that are otherwise complex and difficult to reason about, a number like zero offers something rare — closure.</p>
<p>And that is precisely why it can be misleading.</p>
<hr />
<h2>A Mismatch</h2>
<p>Metrics usually begin as signals. They help us observe a system, compare states, and make decisions with some degree of objectivity. Over time, however, they tend to become targets. Once that happens, the relationship between the metric and the underlying reality starts to weaken.</p>
<p>The number remains. The meaning begins to drift.</p>
<hr />
<p>I ran into this recently in a system where “zero vulnerabilities” had become an important goal.</p>
<p>Two base images were being evaluated.</p>
<p>One passed cleanly — no reported findings. The other did not.</p>
<p>At first glance, the decision seemed straightforward. If the objective is zero, and one option achieves it, the system appears to have made the choice for you.</p>
<p>But something about that conclusion did not sit right.</p>
<p>Not because the number was incorrect, but because it conflicted with an expectation I had about the systems themselves.</p>
<p>One of the images was <em>distroless</em> — intentionally minimal, designed to include only what is required to run the application and little else. The other included a broader runtime environment, with more built-in capabilities available inside the container.</p>
<p>Instinctively, the distroless image felt more constrained. There was simply less present that could be used if something went wrong.</p>
<p>And yet, the scan results suggested the opposite.</p>
<p>The distroless image reported more findings. The other appeared clean.</p>
<p>At that point, the question was no longer which image to choose.</p>
<p>It was: what exactly were we measuring?</p>
<hr />
<h2>What We Were Measuring</h2>
<p>Looking closer did not reveal an issue in how the system behaved. It revealed a limitation in how the system was being observed.</p>
<p>What we had was not a direct measure of risk, but a measure of what the scanning process could identify, based on its data sources and reporting model.</p>
<p>The number was accurate within that context.</p>
<p>But the context itself was narrower than the system it was being used to represent.</p>
<hr />
<p>The situation became clearer when we looked at the same images through different scanners.</p>
<p>The results were not identical.</p>
<p>The counts shifted. Some findings appeared, others disappeared.</p>
<p>Nothing about the underlying system had changed.</p>
<p>Only the lens had.</p>
<p>Which meant the number we were optimizing for was not just a function of the system — but also of how it was being observed.</p>
<hr />
<h2>When Metrics Become Targets</h2>
<p>There is a broader pattern here.</p>
<p>It is often described through <strong>Goodhart’s Law:</strong></p>
<blockquote>
<p>When a metric becomes the goal, the system begins to optimize the metric rather than the outcome it was intended to capture.</p>
</blockquote>
<p>This is not unique to security. It is a general property of optimization systems.</p>
<p>In reinforcement learning, an agent maximizes a reward signal. If the reward function is well-designed, this leads to the desired behavior. If it is not, the agent finds ways to maximize the signal while drifting away from the original intent.</p>
<blockquote>
<p>A <a href="https://openai.com/index/faulty-reward-functions/">well-known example</a> is a simulated boat racing environment from OpenAI, where the agent learned to loop in place collecting reward points instead of completing the race. The system did not fail. It optimized exactly what it was given.</p>
</blockquote>
<p>The behavior was correct with respect to the metric — and incorrect with respect to the outcome.</p>
<p>The system does not fail.</p>
<p>It does exactly what it was asked to do.</p>
<p>Just not what was meant.</p>
<hr />
<p>The same dynamic appears in organizational systems.</p>
<p>If vulnerability count becomes the primary measure of security, teams will optimize for reducing that number.</p>
<p>Over time, the system adapts around the metric.</p>
<p>The number improves.</p>
<p>The underlying risk may not.</p>
<hr />
<p>There is a more subtle implication here.</p>
<p>If the metric had been applied strictly — zero findings as a hard requirement — the distroless image would have been rejected outright.</p>
<p>Not because it exposed more capability.<br />Not because it was more exploitable.</p>
<p>But because <em><strong>it surfaced more findings</strong></em> in a particular scanning context.</p>
<p>The system would have optimized for the metric.</p>
<p>And in doing so, potentially selected an option that was less constrained at runtime.</p>
<p>The outcome would still satisfy the policy.</p>
<p>But not necessarily the intent.</p>
<hr />
<h2>A Different Way to Look at the Same System - 3 Pillar Framework</h2>
<p>CVE count, in this sense, behaves like a proxy.</p>
<p>Useful, but incomplete.</p>
<p>When treated as the objective, it begins to exhibit the same characteristics as a poorly designed reward function — easy to optimize, but not always aligned with what we actually care about.</p>
<hr />
<p>The shift in thinking, when it happened, was not dramatic.</p>
<p>We stopped asking which option had fewer reported vulnerabilities, and started asking what risk we were actually trying to reduce.</p>
<hr />
<p>That led to a different way of evaluating the same system — not as a number, but along three dimensions.</p>
<p>The <strong>first pillar</strong> was about <em><strong>fixability</strong></em>.</p>
<p><strong>Were there vulnerabilities with available fixes that had not yet been applied?</strong></p>
<p>The <strong>second pillar</strong> was about <em><strong>exploitability</strong></em>.</p>
<p><strong>Among the remaining issues, how likely are they to be used in practice?</strong></p>
<p>The <strong>third pillar</strong> was about <em><strong>exposure at runtime</strong></em>.</p>
<p><strong>If something were to get through, what capabilities would be available inside the system?</strong></p>
<hr />
<p>These pillars aka questions did not contradict the metric.</p>
<p>They extended it.</p>
<p>And they changed the answer.</p>
<hr />
<p>In this case,</p>
<p>Pillar 1: Both images had reached a point where there were no immediate fixes to apply.</p>
<p>Pillar 2: Both showed low likelihood of exploitation based on available data.</p>
<p>Pillar 3: One image exposed more capability at runtime (and that was distroless).</p>
<p>The distinction lay in what the system allowed once running.</p>
<p>The other constrained it.</p>
<hr />
<p>The number had pointed in one direction.</p>
<p>The system, when examined through a different lens, pointed in another.</p>
<hr />
<h2>What Changes When You Look Differently</h2>
<p>This way of thinking is not new.</p>
<p>It is reflected, in different forms, in existing security guidance.</p>
<p><strong>NIST SP 800-190</strong> treats vulnerability management as a continuous, risk-based process rather than a binary state.</p>
<p>The <strong>CIS Docker Benchmark</strong> emphasizes reducing attack surface — removing unnecessary components, limiting capabilities, and constraining what is available at runtime.</p>
<p>The principles are well understood.</p>
<p>What varies is how they are applied in practice.</p>
<p>This alignment is not accidental.</p>
<p>Most mature security frameworks do not define security as the <em>absence of findings,</em> but as the <strong>presence of effective controls and risk management.</strong></p>
<p>Standards like SOC 2 and ISO 27001 focus on <strong>how vulnerabilities are identified, assessed, and managed over time</strong> — not on achieving a static count.</p>
<p>Even stricter environments such as FedRAMP operate on <strong>continuous monitoring and risk acceptance</strong>, rather than assuming that “zero findings” is a meaningful or achievable steady state.</p>
<p>The emphasis, consistently, is on <em><strong>managing</strong></em> risk — not eliminating numbers.</p>
<hr />
<h2>Metrics, and What They Hide</h2>
<p>Metrics are abstractions.</p>
<p>They compress a complex system into something that can be observed, compared, and optimized.</p>
<p>That compression is valuable.</p>
<p>But it also hides detail.</p>
<hr />
<p>The goal, then, is not to eliminate metrics.</p>
<p>They remain essential.</p>
<p>But they cannot be the decision.</p>
<hr />
<p>They are a starting point.</p>
<p>The rest requires judgment.</p>
<hr />
<p>Because in systems of any real complexity, the cleanest number is not always the clearest signal.</p>
<p>And sometimes, zero is not the end of the story.</p>
<p>It is simply a reflection of what we chose to see — and what we didn’t..</p>
]]></content:encoded></item><item><title><![CDATA[API Maturity in the Age of AI]]></title><description><![CDATA[The rise of AI agents, tool ecosystems, and Model Context Protocol (MCP) integrations is changing how software interacts with APIs.
For years, APIs were designed primarily for human developers.Develop]]></description><link>https://aravindv.pro/api-maturity-in-the-age-of-ai</link><guid isPermaLink="true">https://aravindv.pro/api-maturity-in-the-age-of-ai</guid><category><![CDATA[api]]></category><category><![CDATA[mcp]]></category><category><![CDATA[REST API]]></category><category><![CDATA[Restful API Modeling Language]]></category><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Thu, 12 Mar 2026 04:38:54 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/694e3ae41267e9e911e72c10/9197fa0c-5ba5-4ad4-a0a9-b4eb54498ed1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The rise of AI agents, tool ecosystems, and Model Context Protocol (MCP) integrations is changing how software interacts with APIs.</p>
<p>For years, APIs were designed primarily for human developers.<br />Developers read documentation, understood workflows, and wrote integration code.</p>
<p>Increasingly, however, APIs are being consumed by machines.</p>
<p>Agents invoke tools.<br />LLMs orchestrate workflows.<br />MCP servers expose capabilities dynamically.</p>
<p>In this world, APIs are no longer just interfaces for developers.<br /><strong>They are interfaces that machines must be able to understand and navigate.</strong></p>
<p>Which raises an important question:</p>
<blockquote>
<p>How easy is it for a machine to understand and navigate your API?</p>
</blockquote>
<p>And this raises an interesting question:</p>
<blockquote>
<p><em>How easy is it for a machine to understand and navigate your API?</em></p>
</blockquote>
<p>Interfaces that are <strong>structured, predictable, and semantically meaningful</strong> are easier for both humans and machines to understand.</p>
<p>Almost two decades ago, <strong>Leonard Richardson</strong> proposed a framework that captures this idea remarkably well.</p>
<p>It is called the <strong>Richardson Maturity Model (RMM)</strong>.</p>
<hr />
<h2>What this article is about...</h2>
<p>This article explores:</p>
<ul>
<li><p>Why API structure matters even more in the age of <strong>AI agents and MCP</strong></p>
</li>
<li><p>What the <strong>Richardson Maturity Model (RMM)</strong> is</p>
</li>
<li><p>How APIs evolve across the <strong>four maturity levels</strong></p>
</li>
<li><p>Why most real-world APIs stop at <strong>Level 2</strong></p>
</li>
<li><p>How API maturity affects <strong>machine-driven integrations</strong></p>
</li>
</ul>
<hr />
<h1>What is the Richardson Maturity Model?</h1>
<p>The <strong>Richardson Maturity Model</strong> describes how effectively an API uses the features of the web.</p>
<p>At its core, the model asks a simple question:</p>
<blockquote>
<p><em>Are we using HTTP as a true application protocol, or merely as a transport mechanism?</em></p>
</blockquote>
<p>The more an API leverages the semantics of the web — <strong>resources, HTTP verbs, status codes, and hypermedia</strong> — the more mature it is considered.</p>
<p>The model defines a progression of maturity levels.</p>
<hr />
<h1>RMM Levels at a Glance</h1>
<table>
<thead>
<tr>
<th>Level</th>
<th>Core Idea</th>
<th>API Style</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Level 0</strong></td>
<td>HTTP as transport</td>
<td>Single endpoint, payload determines operation</td>
</tr>
<tr>
<td><strong>Level 1</strong></td>
<td>Resources</td>
<td>Multiple URIs but operations encoded in the path</td>
</tr>
<tr>
<td><strong>Level 2</strong></td>
<td>HTTP semantics</td>
<td>Proper use of verbs and status codes</td>
</tr>
<tr>
<td><strong>Level 3</strong></td>
<td>Hypermedia</td>
<td>APIs describe possible next actions</td>
</tr>
</tbody></table>
<hr />
<blockquote>
<p><strong>Engineering Insight</strong></p>
<p>Interfaces that are structured, predictable, and semantically meaningful are easier for both humans and machines to understand.</p>
</blockquote>
<hr />
<h1>Example: Coffee Ordering API</h1>
<p>To demonstrate the model, imagine we are building an API for a <strong>coffee ordering application</strong>.</p>
<p>The system supports three operations:</p>
<ol>
<li><p>Fetch the menu</p>
</li>
<li><p>Place an order</p>
</li>
<li><p>Check the order status</p>
</li>
</ol>
<hr />
<h2>Level 0 — One Service to Rule Them All</h2>
<p>Level 0 represents the <strong>lowest level of maturity</strong>.</p>
<p>HTTP is used purely as a <strong>transport layer</strong>, and the payload determines the operation.</p>
<h3>Endpoint</h3>
<pre><code class="language-plaintext">POST /coffee
</code></pre>
<h3>Example Request — Fetch Menu</h3>
<pre><code class="language-plaintext">{
  "operation": "getMenu"
}
</code></pre>
<h3>Example Request — Order Coffee</h3>
<pre><code class="language-plaintext">{
  "operation": "orderCoffee",
  "coffeeType": "Latte",
  "size": "Large"
}
</code></pre>
<h3>Example Request — Order Status</h3>
<pre><code class="language-plaintext">{
  "operation": "getOrderStatus",
  "orderId": 123
}
</code></pre>
<hr />
<h2>Level 1 — Resources</h2>
<p>Level 1 introduces <strong>multiple URIs representing resources</strong>, but still does not fully leverage HTTP semantics.</p>
<p>Example endpoints:</p>
<pre><code class="language-plaintext">GET /getMenu
POST /orderCoffee
GET /getOrderStatus?orderId=123
</code></pre>
<h3>Example Response</h3>
<pre><code class="language-plaintext">{
  "orderId": 123,
  "status": "PREPARING"
}
</code></pre>
<hr />
<h2>Level 2 — HTTP Verbs</h2>
<p>Level 2 introduces proper use of <strong>HTTP verbs and status codes</strong>.</p>
<p>This is the level most modern APIs operate at.</p>
<h3>Fetch Menu</h3>
<pre><code class="language-plaintext">GET /menu
</code></pre>
<h3>Response</h3>
<pre><code class="language-plaintext">{
  "items": [
    {
      "id": 1,
      "name": "Latte",
      "sizes": ["Small","Medium","Large"]
    }
  ]
}
</code></pre>
<hr />
<h3>Create Order</h3>
<pre><code class="language-plaintext">POST /orders
</code></pre>
<h3>Request</h3>
<pre><code class="language-plaintext">{
  "itemId": 1,
  "size": "Large"
}
</code></pre>
<h3>Response</h3>
<pre><code class="language-plaintext">{
  "orderId": 123,
  "status": "PREPARING"
}
</code></pre>
<hr />
<h2>Level 3 — Hypermedia (HATEOAS)</h2>
<p>Level 3 introduces <strong>Hypermedia as the Engine of Application State</strong>.</p>
<p>Responses include links describing possible next actions.</p>
<h3>Example Response</h3>
<pre><code class="language-plaintext">{
  "orderId": 123,
  "status": "PREPARING",
  "links": {
    "self": "/orders/123",
    "cancel": "/orders/123/cancel",
    "status": "/orders/123"
  }
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/694e3ae41267e9e911e72c10/f4e93c4e-ea01-4808-9a6b-0aa6de3233e5.png" alt="" style="display:block;margin:0 auto" />

<h1>Why Most APIs Stop at Level 2</h1>
<p>Although Level 3 represents the highest level of maturity, <strong>most production APIs stop at Level 2</strong>.</p>
<p>Several practical reasons explain this:</p>
<p><strong>Client complexity</strong></p>
<p>Hypermedia-driven APIs require clients to dynamically interpret links.</p>
<p><strong>Tooling limitations</strong></p>
<p>Most tooling (OpenAPI, SDK generators, API gateways) assumes <strong>explicit endpoints</strong>, not hypermedia navigation.</p>
<p><strong>Frontend expectations</strong></p>
<p>Frontend systems prefer predictable contracts rather than dynamic discovery.</p>
<p>Because of these factors, <strong>Level 2 has effectively become the practical standard</strong>.</p>
<hr />
<blockquote>
<p><strong>AI &amp; MCP Perspective</strong></p>
<p>AI agents interact with APIs very differently from human developers.</p>
<p>Instead of reading documentation, they rely on clear structure,</p>
<p>predictable semantics, and discoverable workflows.</p>
<p>APIs that follow good REST principles naturally provide these signals.</p>
</blockquote>
<hr />
<h2>RMM in the Age of AI and MCP</h2>
<p>The rise of AI agents, tool ecosystems, and Model Context Protocol (MCP) integrations makes the ideas behind the Richardson Maturity Model relevant again.</p>
<p>AI systems interact with APIs very differently from human developers.</p>
<p>They don’t read documentation or follow predefined workflows.<br />They rely on signals exposed by the API itself.</p>
<p>These signals include:</p>
<ul>
<li><p>Clear resources</p>
</li>
<li><p>Predictable operations</p>
</li>
<li><p>Consistent response structures</p>
</li>
<li><p>Discoverable workflows</p>
</li>
</ul>
<p>APIs that follow strong design principles naturally provide these signals.</p>
<hr />
<p><strong>At Level 0</strong>, APIs expose very little structure.<br />Interactions are opaque and require interpretation, making them difficult for machines to use reliably.</p>
<p><strong>By Level 2</strong>, APIs become significantly easier for automated systems to interact with.<br />Resources, operations, and outcomes are explicit and predictable.</p>
<p><strong>Level 3</strong> — with hypermedia-driven navigation — goes a step further.<br />It begins to resemble <strong>machine-discoverable workflows</strong>, where the API itself guides what can be done next.</p>
<p>This aligns closely with how AI agents explore tools and how MCP-style systems expose capabilities dynamically.</p>
<hr />
<p>What enables this is not just better use of HTTP, but clearer expression of the underlying system.</p>
<p>For example:</p>
<pre><code class="language-plaintext">GET /menu/{menuItem}/options
</code></pre>
<p>This does more than return data.</p>
<p>It expresses that a menu item has configurable options.</p>
<p>When extended further:</p>
<pre><code class="language-plaintext">GET /menu/{menuItem}/sizes  
GET /menu/{menuItem}/milk-options  
</code></pre>
<p>Different aspects of the system become independently addressable.</p>
<p>This makes the API easier to explore, reason about, and compose into workflows — especially for machines.</p>
<hr />
<p>API maturity, therefore, is not just about correctness.</p>
<p>It is about how easily another system can understand and navigate what is exposed.</p>
<p>In a world where APIs are increasingly consumed by AI agents, clarity of structure and intent becomes a fundamental requirement.</p>
<hr />
<h1>Summary</h1>
<p>The Richardson Maturity Model provides a useful lens for understanding API design.</p>
<ul>
<li><p><strong>Level 0 — One Service to Rule Them All</strong><br />A single endpoint controls everything, with behavior hidden in the payload.</p>
</li>
<li><p><strong>Level 1 — One Method to Rule Them All</strong><br />Resources appear, but intent is still implicit and constrained.</p>
</li>
<li><p><strong>Level 2 — Federalism</strong><br />Structure emerges. Responsibilities are distributed, and interactions become more predictable — <em>no longer one power controlling everything.</em></p>
</li>
<li><p><strong>Level 3 — Cooperative Federalism</strong><br />The system becomes navigable. It begins to guide clients through possible next actions — <em>almost as if the path is being shown, not guessed.</em></p>
</li>
</ul>
<hr />
<p>In practice, most APIs stop at Level 2 — and for good reason.</p>
<p>But the real value of the model is not in reaching Level 3.</p>
<p>It is in asking a more important question:</p>
<blockquote>
<p>How clearly does your API express the system it represents?</p>
</blockquote>
<p>At lower levels, interacting with an API feels like dealing with a powerful but opaque system — everything is possible, but nothing is obvious.</p>
<p>As maturity increases, structure emerges.<br />Intent becomes clearer.<br />Navigation becomes easier.</p>
<p>And at the highest level, the system begins to guide you — not unlike a well-charted world where the next step is always visible.</p>
<p>In a world where APIs are increasingly consumed by AI agents, not just developers, this clarity is no longer optional.</p>
<p>Because for machines, the API is not just an interface.</p>
<p>It is the only way the system reveals itself.</p>
]]></content:encoded></item><item><title><![CDATA[When Consumers Fail: Extending Selective Retry to Kafka]]></title><description><![CDATA[Designing a Kafka Error Pipeline
Kafka consumers fail for the same reason every distributed component fails: not all errors mean the same thing.
A network timeout, a malformed payload, and a downstrea]]></description><link>https://aravindv.pro/when-consumers-fail-extending-selective-retry-to-kafka</link><guid isPermaLink="true">https://aravindv.pro/when-consumers-fail-extending-selective-retry-to-kafka</guid><category><![CDATA[Resilience]]></category><category><![CDATA[kafka]]></category><category><![CDATA[error handling]]></category><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Wed, 04 Mar 2026 14:58:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/694e3ae41267e9e911e72c10/3a6ac75b-880d-41c8-b19f-0cb9ebda778c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Designing a Kafka Error Pipeline</h2>
<p>Kafka consumers fail for the same reason every distributed component fails: <strong>not all errors mean the same thing.</strong></p>
<p>A network timeout, a malformed payload, and a downstream outage are all failures — but treating them the same leads to the same bad outcomes: wasted retries, stalled partitions, and messages cycling through retry loops that were never going to succeed.</p>
<p>Most Kafka consumer setups do exactly that: they retry everything until a limit is reached and only then decide what to do with the message.</p>
<p>That works — until it doesn't.</p>
<p>In the previous post, <a href="https://aravindv.pro/not-all-failures-deserve-a-second-chance"><em>Not All Failures Deserve a Second Chance</em></a>, the focus was classification: a typed exception carrying an <code>ErrorType</code>, and a retry policy that understands the semantics of failure instead of guessing from exception classes.</p>
<p>That article covered the outbound side of a system — REST clients, Kafka producers, and service calls.</p>
<p>This post moves to the other side of the boundary: <strong>the Kafka consumer pipeline.</strong></p>
<p>The classification already exists.<br />The retry policy already exists.</p>
<p>What remains is shaping the consumer pipeline so it respects those decisions.</p>
<h2>Consumer Failure Pipeline</h2>
<p>Before looking at the code, it helps to picture the error flow at a system level.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694e3ae41267e9e911e72c10/4776fcb6-d6c0-44de-857a-ea7b5379819b.svg" alt="" style="display:block;margin:0 auto" />

<p>The diagram shows how a consumer failure is classified and how that classification determines the routing path.</p>
<p>Two decisions drive the entire pipeline:</p>
<ol>
<li><p><strong>Is the failure retryable?</strong><br />This is decided once by the <code>SelectiveRetryPolicy</code>, which classifies the error according to its semantics (transient vs. permanent, etc.).</p>
</li>
<li><p><strong>If retries exhaust, where does the message go next?</strong><br />That routing decision is determined by configuration (for example: retry topic, dead-letter topic, parking lot, or discard).</p>
</li>
</ol>
<p>Separating classification from routing keeps the error-handling logic predictable: classification happens once, and subsequent routing behaviour follows from that classification and your configured policies.</p>
<h2>What DefaultErrorHandler Doesn't Do</h2>
<p>Spring Kafka's <code>DefaultErrorHandler</code> retries up to a configured limit and routes to a recoverer on exhaustion.</p>
<p>That behaviour is useful — but it applies the same retry loop regardless of <em>why</em> the record failed.</p>
<p>A <code>VALIDATION_ERROR</code> receives the same retry attempts as a transient <code>SERVICE_SERVER_ERROR</code>. Those retries accomplish nothing. The payload will not change between attempts. The only effect is that the partition is blocked while the consumer repeatedly processes something it already knows will fail.</p>
<p><code>DefaultErrorHandler</code> does not decide <em>whether</em> something should be retried.</p>
<p>It simply retries.</p>
<p>The solution is to place a handler in front of it that decides whether the record should enter that retry loop at all.</p>
<h2>CompositeKafkaErrorHandler</h2>
<p>When a Kafka consumer throws, Spring invokes <code>handleOne()</code> on the configured <code>CommonErrorHandler</code> — in most apps this is <code>DefaultErrorHandler</code>. <code>CompositeKafkaErrorHandler</code> wraps that handler and introduces one additional decision point before delegating.</p>
<p>It composes three things:</p>
<ul>
<li><p>a <code>SelectiveRetryPolicy</code> (from the previous article)</p>
</li>
<li><p>a recoverer for retryable errors whose retries are exhausted</p>
</li>
<li><p>a recoverer for non‑retryable errors</p>
</li>
</ul>
<p>The two recoverers represent different outcomes: a transient error that exhausts retries and a validation/semantic failure are both routed somewhere observable, but they should not reach that destination in the same way or at the same time in the flow.</p>
<pre><code class="language-java">public class CompositeKafkaErrorHandler&lt;T extends Throwable, E extends Enum&lt;E&gt;&gt;
        implements CommonErrorHandler {

    private final DefaultErrorHandler defaultErrorHandler;
    private final SelectiveRetryPolicy&lt;T, E&gt; retryPolicy;
    private final BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt; exhaustedRecoverer;
    private final BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt; nonRetryableRecoverer;

    public CompositeKafkaErrorHandler(
            SelectiveRetryPolicy&lt;T, E&gt; retryPolicy,
            BackOff backOff,
            BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt; exhaustedRecoverer,
            BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt; nonRetryableRecoverer) {
        this.retryPolicy = retryPolicy;
        this.exhaustedRecoverer = exhaustedRecoverer;
        this.nonRetryableRecoverer = nonRetryableRecoverer;
        this.defaultErrorHandler = new DefaultErrorHandler(
                (record, ex) -&gt; exhaustedRecoverer.accept(record, ex),
                backOff);
    }

    // ... delegate handleOne() to defaultErrorHandler after consulting retryPolicy ...
}
</code></pre>
<p>The generics mirror the retry policy: <code>&lt;T extends Throwable, E extends Enum&lt;E&gt;&gt;</code>.</p>
<p>Importantly, the handler is policy-driven: it doesn't depend on <code>ApplicationException</code> or an <code>ErrorType</code> enum directly. It only asks the <code>SelectiveRetryPolicy</code> whether a failure is retryable and routes accordingly, keeping the handler reusable across services with different exception hierarchies.</p>
<h2>Routing the Failure</h2>
<p><code>handleOne()</code> performs the routing.</p>
<pre><code class="language-java">@Override
public boolean handleOne(Exception thrownException, ConsumerRecord&lt;?, ?&gt; record,
Consumer&lt;?, ?&gt; consumer, MessageListenerContainer container) {

// Deserialization failures occur before business logic runs.
if (thrownException.getCause() instanceof DeserializationException) {
nonRetryableRecoverer.accept(record, thrownException);
commitOffset(record, consumer);
return true;
}

// Ask the retry policy first.
if (isNonRetryableError(thrownException)) {
nonRetryableRecoverer.accept(record, thrownException);
commitOffset(record, consumer);
return true;
}

// Retryable → delegate to DefaultErrorHandler
return defaultErrorHandler.handleOne(thrownException, record, consumer, container);
}
</code></pre>
<p>The first branch handles deserialization failures.</p>
<p>These occur before business logic runs, meaning the typed exception hierarchy is not involved. <code>DeserializationException.getData()</code> contains the raw bytes of the message. Those bytes must be preserved and sent somewhere observable before the record is skipped.</p>
<p>Silently dropping malformed messages means you have no record that they ever arrived.</p>
<p>The second branch asks the retry policy whether the failure is non‑retryable. The policy evaluates the typed error classification first and falls back to exception‑class rules if necessary.</p>
<p>Non‑retryable failures skip the retry loop entirely.</p>
<p>The record is routed to the non‑retryable recoverer and the offset is committed at <code>offset + 1</code>. This tells Kafka the record has been handled and the consumer should move forward.</p>
<p>Because this path bypasses <code>DefaultErrorHandler</code>, the handler must also take responsibility for committing the offset itself.</p>
<p>Without that commit the consumer would stall on the same record indefinitely.</p>
<p>If neither branch matches, the exception is considered retryable and <code>DefaultErrorHandler</code> takes over using the configured <code>BackOff</code>.</p>
<p>When retries are exhausted, the recoverer wired into the constructor invokes <code>exhaustedRecoverer</code>.</p>
<h2>Recoverers Belong in Configuration</h2>
<p>The handler is infrastructure. It understands <em>how</em> to route failures but not <em>where</em> they should go.</p>
<p>Destination policy belongs in configuration.</p>
<p><code>CompositeKafkaErrorHandler</code> therefore accepts two <code>BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt;</code> functions:</p>
<pre><code class="language-java">BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt; exhaustedRecoverer = (record, exception) -&gt; 
{
        if (retryProperties.isEnabled()) {
                errorPublisher.publishToRetry1m(record, exception);
        } else {
                ApplicationException appEx = exception instanceof ApplicationException ae ? ae: ApplicationException.of(ErrorType.SERVICE_SERVER_ERROR, "Retries exhausted: " + exception.getMessage());
                errorPublisher.publishToDlq(record.topic(), record.value(), appEx, record);
        }
};

BiConsumer&lt;ConsumerRecord&lt;?, ?&gt;, Exception&gt; nonRetryableRecoverer = (record, exception) -&gt; 
{
        if (exception.getCause() instanceof DeserializationException) {
                errorPublisher.publishDeserializationFailureToDlq(record, exception);
        } else {
                errorPublisher.publishBatchRecordToDlq(record, exception);
        }
};

return new CompositeKafkaErrorHandler&lt;&gt;(kafkaConsumerRetryPolicy, backOff, exhaustedRecoverer, nonRetryableRecoverer);
</code></pre>
<p>The handler remains reusable across services.</p>
<p>Only the routing behaviour changes — which DLQ topic to use, whether retry topics are enabled, or how generic exceptions should be wrapped..</p>
<h2>The Retry Chain</h2>
<p>In‑process retries are designed for <strong>milliseconds of instability, not minutes of outage</strong>.</p>
<p>They handle short‑lived issues such as network jitter or brief service hiccups. They do not solve problems like dependency restarts, rate limits that reset on minute boundaries, or a database mid‑failover.</p>
<p>For those situations the message needs time.</p>
<p>When in‑process retries exhaust on a retryable error, the record is moved to a separate Kafka topic instead of going directly to the DLQ.</p>
<p>A dedicated consumer processes <code>source-topic-retry-1m</code>, waiting at least one minute before redelivery. If retries exhaust again, the message progresses to <code>source-topic-retry-5m</code>, and finally to the DLQ if all stages fail.</p>
<p>Non‑retryable failures bypass this chain entirely.</p>
<p>A validation error that occurs at <code>10:00:00</code> should be visible in the DLQ at <code>10:00:00</code> — not at <code>10:06:00</code> after exhausting retry stages it had no reason to enter.</p>
<p>The chain is controlled through configuration:</p>
<pre><code class="language-yaml">kafka:
  retry:
    enabled: ${KAFKA_RETRY_ENABLED:true}
    suffixes:
      first-stage: ${KAFKA_RETRY_1M_SUFFIX:retry-1m}
      second-stage: ${KAFKA_RETRY_5M_SUFFIX:retry-5m}
  error:
    dlq-suffix: ${KAFKA_DLQ_SUFFIX:-dlq}
</code></pre>
<p>Setting <code>kafka.retry.enabled=false</code> routes exhausted retryable errors directly to the DLQ in environments that do not provision retry topics.</p>
<p>What is not optional is the DLQ itself.</p>
<p>A consumer that silently drops failed records has no observable error signal.</p>
<h2>One Classification, Two Contexts</h2>
<p>The <code>SelectiveRetryPolicy</code> driving <code>CompositeKafkaErrorHandler</code> is the same policy used for outbound retries.</p>
<p>The failure taxonomy is defined once.</p>
<p>Whether <code>VALIDATION_ERROR</code> should be retried is not a decision repeated in HTTP clients, Kafka consumers, and service calls. The behaviour follows directly from the <code>ErrorType</code> definition.</p>
<p>Add a new <code>ErrorType</code> variant with the correct semantics and it automatically receives the right behaviour everywhere — inbound and outbound.</p>
<p>That is what it means for retry logic to be readable and auditable.</p>
<p><em><strong>Not just in one place.</strong></em></p>
<p><em><strong>Across the system.</strong></em></p>
]]></content:encoded></item><item><title><![CDATA[Not All Failures Deserve a Second Chance ]]></title><description><![CDATA[Retry logic is one of those things that feels simple until it isn’t.
You add @Retryable, set maxAttempts = 3, and move on. Everything works — until the system happily retries a validation error three ]]></description><link>https://aravindv.pro/not-all-failures-deserve-a-second-chance</link><guid isPermaLink="true">https://aravindv.pro/not-all-failures-deserve-a-second-chance</guid><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Wed, 25 Feb 2026 06:06:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/694e3ae41267e9e911e72c10/67848cf0-68da-4094-87a5-ae5eb1a3b67b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Retry logic is one of those things that feels simple until it isn’t.</p>
<p>You add <code>@Retryable</code>, set <code>maxAttempts = 3</code>, and move on. Everything works — until the system happily retries a validation error three times, re-publishes a message that was never going to fit in Kafka, or quietly <em>fails to retry</em> something that actually was transient.</p>
<p>At that point, retry logic stops being resilience. It becomes noise with a delay.</p>
<p>This post describes a pattern we introduced to prevent that kind of misconfiguration from ever becoming a production problem: a <strong>selective retry policy</strong> that makes retry intent explicit and centrally defined.</p>
<hr />
<h2>What Spring Retry Already Gives You</h2>
<p>Spring Retry’s <code>SimpleRetryPolicy</code> allows exception-class-based configuration:</p>
<pre><code class="language-java">Map&lt;Class&lt;? extends Throwable&gt;, Boolean&gt; retryable = new HashMap&lt;&gt;();
retryable.put(SocketTimeoutException.class, true);
retryable.put(ValidationException.class, false);

new SimpleRetryPolicy(3, retryable, true);
</code></pre>
<p>This works — but only at the class level.</p>
<p>In practice:</p>
<ul>
<li><p>A single <code>ApplicationException</code> may represent multiple failure modes.</p>
</li>
<li><p>Configuration becomes scattered across services.</p>
</li>
<li><p>There is no protection against marking the same exception as both retryable and non-retryable.</p>
</li>
</ul>
<p>Retries are not about exception structure.</p>
<p>They are about <strong>failure semantics</strong>.</p>
<hr />
<h2>The Core Insight: Failures Have Meaning</h2>
<p>Broadly speaking:</p>
<p><strong>Transient failures</strong></p>
<p>Temporary issues — network blips, timeouts, infrastructure instability.<br /><em>Retries help.</em></p>
<p><strong>Permanent failures</strong><br />Invalid input, malformed payloads, authentication failures.<br /><em>Retries waste time.</em></p>
<p>Retry logic should <strong>understand</strong> that distinction.</p>
<hr />
<h2>Step One: A Typed Exception with an Error Enum</h2>
<p>Instead of proliferating exception classes, introduce a single typed exception:</p>
<pre><code class="language-java">public enum ErrorType {
    VALIDATION_ERROR,
    SERVICE_CLIENT_ERROR,   // 4xx
    SERVICE_SERVER_ERROR,   // 5xx
    MESSAGING_ERROR,
    AUTHENTICATION_ERROR,
    UNKNOWN_ERROR
}

public class ApplicationException extends RuntimeException {

    private final ErrorType type;

    public ApplicationException(ErrorType type, String message) {
        super(message);
        this.type = type;
    }

    public ApplicationException(ErrorType type, String message, Throwable cause) {
        super(message, cause);
        this.type = type;
    }

    public ErrorType getType() {
        return type;
    }
}
</code></pre>
<p>Now the exception carries semantic intent.</p>
<p>The retry policy no longer needs to infer behaviour from class names or messages.</p>
<hr />
<h2>Step Two: The Selective Retry Policy</h2>
<p><code>SelectiveRetryPolicy</code> extends <code>SimpleRetryPolicy</code> and introduces:</p>
<ul>
<li><p>retryable exception classes</p>
</li>
<li><p>non-retryable exception classes</p>
</li>
<li><p>retryable error types</p>
</li>
<li><p>non-retryable error types</p>
</li>
<li><p>strict validation against overlap</p>
</li>
</ul>
<h3>Overlap Protection</h3>
<p>If an error type or exception class is configured as both retryable and non-retryable, the application fails at startup.</p>
<p>For example, this configuration is rejected:</p>
<pre><code class="language-java">SelectiveRetryPolicy.&lt;ApplicationException, ErrorType&gt;builder()
    .retryOnErrorType(ErrorType.SERVICE_SERVER_ERROR)
    .doNotRetryOnErrorType(ErrorType.SERVICE_SERVER_ERROR) // Illegal
    .build();
</code></pre>
<p>Misconfiguration is treated as a configuration error — not a runtime surprise. The actual policy is <a href="https://raw.githubusercontent.com/aravind-vis/blogs/refs/heads/main/SelectiveRetryPolicy.java">here</a></p>
<hr />
<h2>Precedence Model</h2>
<p>The evaluation order is explicit:</p>
<pre><code class="language-plaintext">1. Typed exception error type rules
   └── Explicitly non-retryable → do not retry
   └── Explicitly retryable     → retry
   └── Otherwise                → fall through

2. Exception class rules
   └── Explicitly non-retryable → do not retry
   └── Explicitly retryable     → retry
   └── Otherwise                → default

3. Default behaviour → retry
</code></pre>
<p>Non-retryable rules always take precedence.</p>
<p>If no rule matches, the policy defaults to retry.</p>
<p>This is intentional and can be flipped if a fail-safe approach is preferred.</p>
<hr />
<h2>Usage Example 1 — REST Client</h2>
<p>Define the retry template once:</p>
<pre><code class="language-java">@Bean("serviceRetryTemplate")
public RetryTemplate serviceRetryTemplate() {

    var policy = SelectiveRetryPolicy.&lt;ApplicationException, ErrorType&gt;builder()
        .maxAttempts(3)
        .typedExceptionClass(ApplicationException.class)
        .errorTypeExtractor(ApplicationException::getType)
        .retryOnException(SocketTimeoutException.class)
        .doNotRetryOnErrorType(ErrorType.VALIDATION_ERROR)
        .doNotRetryOnErrorType(ErrorType.SERVICE_CLIENT_ERROR)
        .build();

    var backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(1000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(10000L);

    RetryTemplate template = new RetryTemplate();
    template.setRetryPolicy(policy);
    template.setBackOffPolicy(backOff);
    return template;
}
</code></pre>
<p>Option A — Explicit RetryTemplate</p>
<pre><code class="language-java">@Service
public class TrackingServiceClient {

    private final RestTemplate restTemplate;
    private final RetryTemplate retryTemplate;

    public TrackingDTO getTracking(String shipmentId) {
        return retryTemplate.execute(ctx -&gt;
            restTemplate.getForObject("/tracking/{id}", TrackingDTO.class, shipmentId)
        );
    }
}
</code></pre>
<p>This makes retry behaviour visible at the call site.</p>
<hr />
<p>Option B — Declarative via @Retryable</p>
<pre><code class="language-java">@Bean("selectiveRetryInterceptor")
public RetryOperationsInterceptor selectiveRetryInterceptor() {
    RetryOperationsInterceptor interceptor = new RetryOperationsInterceptor();
    interceptor.setRetryOperations(serviceRetryTemplate());
    return interceptor;
}
</code></pre>
<pre><code class="language-java">@Service
public class TrackingServiceClient {

    private final RestTemplate restTemplate;

    @Retryable(interceptor = "selectiveRetryInterceptor")
    public TrackingDTO getTracking(String shipmentId) {
        return restTemplate.getForObject("/tracking/{id}", TrackingDTO.class, shipmentId);
    }
}
</code></pre>
<p>The retry contract remains centralised.</p>
<p>The annotation simply references it.</p>
<hr />
<h2>Usage Example 2 — Kafka Producer</h2>
<p>Kafka introduces its own mix of transient and permanent failures.</p>
<pre><code class="language-java">@Bean("kafkaRetryTemplate")
public RetryTemplate kafkaRetryTemplate() {

    var policy = SelectiveRetryPolicy.&lt;ApplicationException, ErrorType&gt;builder()
        .maxAttempts(3)
        .typedExceptionClass(ApplicationException.class)
        .errorTypeExtractor(ApplicationException::getType)
        .retryOnException(KafkaException.class)
        .retryOnException(NetworkException.class)
        .retryOnException(RetriableException.class)
        .doNotRetryOnException(SerializationException.class)
        .doNotRetryOnException(RecordTooLargeException.class)
        .doNotRetryOnException(InvalidTopicException.class)
        .doNotRetryOnErrorType(ErrorType.VALIDATION_ERROR)
        .build();

    var backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(2000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(30000L);

    RetryTemplate template = new RetryTemplate();
    template.setRetryPolicy(policy);
    template.setBackOffPolicy(backOff);
    return template;
}
</code></pre>
<p>Apply it transparently using a proxy:</p>
<pre><code class="language-java">@Bean
public BeanNameAutoProxyCreator notificationProducerProxy() {
    BeanNameAutoProxyCreator creator = new BeanNameAutoProxyCreator();
    creator.setBeanNames("notificationsProducer");
    creator.setInterceptorNames("kafkaRetryInterceptor");
    return creator;
}
</code></pre>
<p>Producer remains clean:</p>
<pre><code class="language-java">@Service("notificationsProducer")
public class NotificationsProducer {

    private final KafkaTemplate&lt;String, Object&gt; kafkaTemplate;

    public void publish(String topic, Object message) {
        kafkaTemplate.send(topic, message);
    }
}
</code></pre>
<hr />
<h2>Why This Exists (Preventative Design)</h2>
<p>The motivation here was not a dramatic outage.</p>
<p>It was to prevent two quiet failure modes from creeping in over time:</p>
<ul>
<li><p><strong>Missed retries</strong> — transient failures wrapped in generic exceptions that never get retried.</p>
</li>
<li><p><strong>Over-retries</strong> — permanent failures retried repeatedly because the policy cannot distinguish them.</p>
</li>
</ul>
<p>Both come from retry logic that lacks semantic awareness.</p>
<p>This pattern removes that ambiguity.</p>
<hr />
<h2>The Trade-off</h2>
<p>This approach shifts responsibility to the throw site.</p>
<p>If an error is classified incorrectly, retry behaviour will also be incorrect.</p>
<p>That is deliberate.</p>
<p>Inferring retry intent from HTTP codes or exception messages tends to become brittle and distributed.</p>
<p>Encoding intent explicitly at the throw site keeps retry behaviour centralised and predictable.</p>
<hr />
<h2>Closing Thoughts</h2>
<p>Retries are not a technical detail.</p>
<p>They are part of system behaviour.</p>
<p>A retry policy that does not understand failure semantics is not resilience.</p>
<p>It is optimism with overhead.</p>
<p>Spring Retry provides the foundation.</p>
<p>Adding a semantic layer on top makes retry behaviour readable, auditable, and difficult to misconfigure.</p>
<p>In a multi-service codebase, that clarity is usually the difference between retry logic you trust and retry logic you tolerate.</p>
]]></content:encoded></item><item><title><![CDATA[Bhishma, Nuremberg, and When Humans Stop Thinking]]></title><description><![CDATA[Bhishma is one of the most tragic figures in the Mahabharata—not because he lacked moral clarity, but because he had too much of it and still failed to act.
In the epic, Bhishma is not a villain or a ]]></description><link>https://aravindv.pro/bhishma-nuremberg-and-when-humans-stop-thinking</link><guid isPermaLink="true">https://aravindv.pro/bhishma-nuremberg-and-when-humans-stop-thinking</guid><category><![CDATA[AI]]></category><category><![CDATA[responsible AI]]></category><category><![CDATA[ai-in-the-trenches]]></category><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Wed, 28 Jan 2026 06:02:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769579904862/9ef8fb57-9b58-43d8-8108-f9ae1cfff223.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Bhishma is one of the most tragic figures in the Mahabharata—not because he lacked moral clarity, but because he had too much of it and still failed to act.</p>
<p>In the epic, Bhishma is not a villain or a bystander. He is the most respected elder of the kingdom, a master of law and ethics—the person everyone agrees understands <em>dharma</em> better than anyone else in the room.</p>
<p>He knew what was right.<br />He knew what was wrong.<br />And yet, when it mattered most, he stood still.</p>
<p>Bhishma’s justification was simple and devastatingly familiar: <em>“I am bound by my duty.”</em><br />As a young man, he had taken an extraordinary vow—to renounce personal power and ambition forever, and to serve the throne unquestioningly, no matter who occupied it or how it was used.</p>
<p>That vow made him incorruptible.<br />It also made him immobile.</p>
<p>Duty to the vow he could not break.<br />Duty to the throne he would not abandon.<br />Duty to the system he had helped uphold, even as it decayed.</p>
<p>He was not evil.<br />He was obedient.</p>
<p>That distinction matters, because history has taught us—repeatedly—that some of the worst failures don’t come from malice. They come from people who stop thinking at precisely the moment when thinking is required.</p>
<h2>The Nuremberg Defense Was Not About Monsters</h2>
<p>After World War II, many of the accused at the Nuremberg Trials offered the same explanation: <em>“I was only following orders.”</em></p>
<p>The world rejected that defense—not because it denied coercion or hierarchy, but because it denied <strong>moral agency</strong>.</p>
<p>The ruling was clear:<br />Obedience does not erase responsibility.<br />Systems do not commit crimes. People do.</p>
<p>What’s striking is that the Nuremberg defense wasn’t about cruelty—it was about <strong>abdication</strong>. About handing over judgment to authority and calling it duty.</p>
<p>If this feels uncomfortably close to Bhishma, it should.</p>
<h2>AI Didn’t Create This Problem</h2>
<p>AI didn’t invent moral abdication.<br />It just made it scalable.</p>
<p>We’re entering a phase where AI systems don’t just suggest—they act.<br />They open pull requests.<br />They recommend architectural changes.<br />They flag risks, approve workflows, and increasingly, execute decisions.</p>
<p>And with that comes a new, dangerously convenient sentence:</p>
<blockquote>
<p>“The system decided.”</p>
</blockquote>
<p>This is not a technical statement.<br />It’s a moral one.</p>
<p>AI doesn’t remove responsibility—it creates <strong>plausible deniability at machine speed</strong>.</p>
<h2>What I’ve Seen in Practice</h2>
<p>In real engineering teams, the story isn’t fear or resistance. It’s subtler.</p>
<p>First, people struggle just to use AI effectively.<br />Then quality issues appear—because AI will generate <em>something</em> even when standards are vague.<br />Only later does the hardest realization land: <strong>ownership never left the human</strong>.</p>
<p>AI-generated code doesn’t fail differently.<br />It fails <em>faster</em>.</p>
<p>What changes is not the nature of responsibility, but its <strong>concentration</strong>. A single decision, poorly reviewed, can now propagate across a codebase at a speed we’ve never had to manage before.</p>
<p>Interestingly, AI also shifts where effort lives. Typing becomes cheap. Thinking becomes expensive. Code reviews move away from syntax toward architecture, correctness, and long-term consequences.</p>
<p>AI doesn’t replace judgment.<br />It exposes how much we relied on muscle memory instead of thought.</p>
<h2>The New Bhishma Risk</h2>
<p>The modern Bhishma won’t stand on a battlefield.</p>
<p>He’ll sit behind dashboards.<br />He’ll approve pipelines.<br />He’ll trust systems that are “working as designed.”</p>
<p>And like Bhishma, he won’t be wrong because he lacked knowledge—<br />but because he treated <strong>faithful execution as righteousness</strong>, and stopped questioning when questioning was required.</p>
<p>History shows us that the hardest failures don’t come from bad intent.<br />They come from people faithfully executing systems without enough questioning built in.</p>
<h2>Krishna’s Correction (Without the Theology)</h2>
<p>Krishna’s message to Arjuna was not “do your duty blindly.”</p>
<p>It was: <em>think, discern, and then act—without hiding behind outcomes.</em></p>
<p>Translated into modern systems thinking:</p>
<ul>
<li><p>Delegation is not abdication</p>
</li>
<li><p>Automation is not absolution</p>
</li>
<li><p>Non-attachment does not mean non-responsibility</p>
</li>
</ul>
<p>AI can assist judgment.<br />It cannot replace it.</p>
<p>The moment we let systems think <em>for</em> us instead of <em>with</em> us, we recreate the conditions that history has already judged harshly.</p>
<h2>A Quiet Test</h2>
<p>Here’s a simple question worth asking as AI becomes more autonomous:</p>
<blockquote>
<p>If this decision causes harm, who would say “I own this”?</p>
</blockquote>
<p>If the answer is vague, distributed, or points to “the system,” we already know how that story ends.</p>
<p>We’ve read it before.<br />In epics.<br />In court transcripts.<br />And now, quietly, in our codebases.</p>
<p>The real risk isn’t artificial intelligence.</p>
<p>It’s when humans stop thinking—and call it progress.</p>
]]></content:encoded></item><item><title><![CDATA[Resilience]]></title><description><![CDATA[What is Resilience?


Introduction
Resilience is one of the most important aspects of any system - more so with the cloud wherein we have a true Distributed System! And when we say a distributed syste]]></description><link>https://aravindv.pro/resilience</link><guid isPermaLink="true">https://aravindv.pro/resilience</guid><category><![CDATA[Resilience]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[fallacy]]></category><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Sun, 11 Jan 2026 16:22:01 GMT</pubDate><content:encoded><![CDATA[<h2>What is Resilience?</h2>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1628523048949/w0kfxu8Hk.png" alt="image.png" />

<h1>Introduction</h1>
<p><strong>Resilience</strong> is one of the most important aspects of any system - more so with the cloud wherein we have a true Distributed System! And when we say a distributed system, the resilience parameters are affected by the evergreen paper - ***Fallacies of Distributed Computing! ***</p>
<p>This blog is a two part series -</p>
<ol>
<li><p>Fallacies of Distributed Computing</p>
</li>
<li><p>How are these addressed in a micro-service architecture?</p>
</li>
</ol>
<p>Before we get into how to go about building resilient systems, we need to understand the word - <strong>Resilience</strong>!</p>
<h1>What is Resilience?</h1>
<p>The following are the definitions that we come across when we lookup the word <em>Resilience</em></p>
<h2>The English Language</h2>
<p><em>The capacity to recover quickly from difficulties; toughness</em></p>
<h2>Computer Architecture</h2>
<p><em>It is the ability to provide and maintain an acceptable level of</em> <em><strong>service</strong></em> <em>in the face of</em> <em><strong>faults</strong></em> <em>and</em> <em><strong>challenges</strong></em> <em>to normal operation</em></p>
<h2>Micro-services</h2>
<p><em>The ability of a microservice to function in spite of errors in the dependent services</em></p>
<p>In other words, failure of one service should not result in the failure of the system as a whole.</p>
<h1>Why is Resilience Important?</h1>
<p>Well, this can be summarised by this beautiful line -</p>
<blockquote>
<p>The oak fought the wind and was broken, the willow bent when it must and survived.<br /><strong>Jordan, R. (1994). The Fires of Heaven. 2nd ed. U.S: Tor Books</strong></p>
</blockquote>
<p>What this effectively means is that any system that we build should not be hardened against failure rather we should be able to cope with the problems that arise and ensure that we are up and running. If we are hardened against failure, we are like the oak tree! There could always be a wind strong enough to break us!</p>
<p>Most importantly, we can always try to reduce failures but we would never be able to erode it - In the simplest case, a failure could be a burnt drive and in the worst case it could be a burnt data center. We don't have control over either of these! Consequently, this implies that we should be able to handle these "never" eliminated failures! This is where we have backup disks and Disaster Recovery (DR) centers (costing us the big bucks!)</p>
<p>When it comes to micro-service, the architecture by itself brings in a level of resilience! We have a highly decoupled system with cohesive services!</p>
<blockquote>
<p>Enough has been spoken about micro-services that I don't need to elaborate on its usefulness (and at times pain!)</p>
</blockquote>
<p>The most important aspect when it comes to Resilience - especially in Cloud and Distributed Architectures are to ensure that the assertions in the paper <em>The Fallacies of Distributed Computing</em> by Peter Deutsch and others (as part of erstwhile Sub Microsystems) are addressed!</p>
<h1>Fallacies of Distributed Computing</h1>
<p>These fallacies mostly arise due to the injection of network within the architecture or application. In order to achieve resilient applications, these fallacies must be addressed</p>
<p>The fallacies are -</p>
<ol>
<li><p><strong>Network is reliable</strong> - A network is never reliable!</p>
</li>
<li><p><strong>Latency is zero</strong> - A latency can be reduced by optimising network but can never be zero....a network packet takes time to move from one point to another and this adds latency!</p>
</li>
<li><p><strong>Infinite bandwidth</strong> - There are physical and logical limitations on the bandwidth provided by the network.</p>
</li>
<li><p><strong>Network is secure</strong> - A chain is only as strong as its weakest link or so they say! If network were secure, we wouldn't have many data leaks!</p>
</li>
<li><p><strong>Network topology is constant</strong> - We always need to assume that the network topology is going to change. A router on the way can be go kaput.. a new firewall rule is introduced! Only change is constant!</p>
</li>
<li><p><strong>Transport cost is zero</strong> - Every network call has an associated cost!</p>
</li>
<li><p><strong>Network is homogeneous</strong> - In the real world and the digital world, we have our differences and we need to ensure that we handle it!</p>
</li>
</ol>
<p>A micro-service architecture brings in complexities in this area as we now have a highly distributed system!</p>
<p>We'll look at how these fallacies are/can be addressed in micro-service architecture in the next blog!</p>
]]></content:encoded></item><item><title><![CDATA[Hello!]]></title><description><![CDATA[Hi 👋 I'm Aravind
I'm a Principal Engineer / Architect with 20+ years of experience designing and evolving large-scale, distributed platforms. Over the years, my work has naturally converged at the intersection of architecture, platform engineering, ...]]></description><link>https://aravindv.pro/hello</link><guid isPermaLink="true">https://aravindv.pro/hello</guid><dc:creator><![CDATA[Aravind Viswanathan]]></dc:creator><pubDate>Fri, 26 Dec 2025 07:57:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/PyegXTiAWEY/upload/c8d9a6b1aeee419d20cc27e0695ce687.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-hi-im-aravind">Hi 👋 I'm Aravind</h2>
<p>I'm a Principal Engineer / Architect with <strong>20+ years of experience</strong> designing and evolving large-scale, distributed platforms. Over the years, my work has naturally converged at the intersection of <strong>architecture, platform engineering, AI governance, and organizational effectiveness</strong>.</p>
<p>I enjoy working on systems that have to <strong>hold up over time</strong> — systems that scale, stay operable in production, and remain trustworthy as complexity and organizational size grow. I care deeply about <strong>clear domain boundaries</strong>, thoughtful trade-offs, and technical decisions that still make sense years later.</p>
<hr />
<h2 id="heading-what-i-spend-most-of-my-time-on">🧭 What I spend most of my time on</h2>
<h3 id="heading-platform-amp-distributed-systems-architecture">Platform &amp; Distributed Systems Architecture</h3>
<p>I design and evolve high-throughput, fault-tolerant systems with a strong bias toward clarity of ownership, predictable behavior in production, and operational simplicity over clever abstractions.</p>
<h3 id="heading-domain-driven-design-at-scale">Domain-Driven Design at scale</h3>
<p>I’ve spent a lot of time helping teams define bounded contexts, align technical boundaries with organizational reality, and avoid the slow erosion that comes from accidental coupling between domains.</p>
<h3 id="heading-ai-platform-design-amp-governance">AI Platform Design &amp; Governance</h3>
<p>As AI systems started becoming part of core platforms, I focused on building <strong>governance models that don’t kill innovation</strong> but still create accountability and trust.</p>
<p>This includes:</p>
<ul>
<li><p>Designing model lifecycle and accountability frameworks</p>
</li>
<li><p>Building <strong>AI-assisted developer tooling</strong></p>
</li>
<li><p>Ensuring AI-enabled systems are observable and operable in production</p>
</li>
</ul>
<h3 id="heading-api-amp-integration-strategy">API &amp; Integration Strategy</h3>
<p>I approach APIs as long-lived products: designed first, versioned carefully, and governed just enough to stay consistent without becoming a bottleneck.</p>
<h3 id="heading-engineering-effectiveness-amp-technical-leadership">Engineering Effectiveness &amp; Technical Leadership</h3>
<p>A lot of my impact comes from raising the technical bar across teams — through design reviews, mentorship, and standards — and by building <strong>developer tooling and automation</strong> that makes the right path the easy one.</p>
<h3 id="heading-production-first-thinking">Production-first thinking</h3>
<p>I treat observability, reliability, security, and cost as design constraints from day one. Zero-downtime deployments, blue/green and canary strategies, and operational guardrails aren’t “nice to have” — they’re part of the architecture.</p>
<hr />
<h2 id="heading-technical-depth-the-tools-i-reach-for">🛠️ Technical depth (the tools I reach for)</h2>
<ul>
<li><p><strong>Languages</strong>: Java (20+ years), Go</p>
</li>
<li><p><strong>Application Platforms</strong>: Spring / Spring Boot, cloud-native architectures</p>
</li>
<li><p><strong>Architecture</strong>: Microservices, event-driven systems, API platforms</p>
</li>
<li><p><strong>Cloud &amp; Infrastructure</strong>: Kubernetes, Docker, Helm, GitOps (ArgoCD), Terraform</p>
</li>
<li><p><strong>Platform Concerns</strong>: Infrastructure migration &amp; modernization, container security and runtime protection</p>
</li>
<li><p><strong>Data</strong>: Relational-first design, PostgreSQL sharding &amp; partitioning, MongoDB for configuration and operational data, pragmatic polyglot persistence</p>
</li>
<li><p><strong>Streaming &amp; Messaging</strong>: Kafka-based architectures</p>
</li>
<li><p><strong>Observability</strong>: Metrics, tracing, logging, OpenTelemetry-based systems</p>
</li>
<li><p><strong>AI &amp; Governance</strong>:</p>
<ul>
<li><p>AI/ML platform integration</p>
</li>
<li><p>Model lifecycle &amp; accountability frameworks</p>
</li>
<li><p>Operational observability for AI systems</p>
</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-what-this-github-space-is-for">📌 What this GitHub space is for</h2>
<p>This GitHub space reflects how I think and work:</p>
<ul>
<li><p>Architecture explorations and design notes</p>
</li>
<li><p>Platform, infrastructure, and AI governance experiments</p>
</li>
<li><p>Opinionated but pragmatic approaches to system and platform design</p>
</li>
<li><p>Code and documentation written with <strong>production reality</strong> in mind</p>
</li>
</ul>
<p>I value <strong>clarity over cleverness</strong>, <strong>stability over novelty</strong>, and <strong>decisions that age well</strong>.</p>
<hr />
<h2 id="heading-ask-me-about">💬 Ask me about</h2>
<ul>
<li><p>Designing platforms that survive scale, re-orgs, and regulatory pressure</p>
</li>
<li><p>Domain boundaries, ownership, and DDD in real organizations</p>
</li>
<li><p>JVM performance, GC behavior, and production tuning</p>
</li>
<li><p>API strategy and governance</p>
</li>
<li><p>Responsible AI adoption and governance models</p>
</li>
<li><p>Technical leadership and architecture influence at senior levels</p>
</li>
</ul>
<hr />
<h2 id="heading-perspective">⚡ Perspective</h2>
<p>Good architecture is less about frameworks and more about <strong>constraints, trade-offs, and knowing what <em>not</em> to build</strong> — especially when AI enters the system.</p>
]]></content:encoded></item></channel></rss>