Aravind's Blog

How I Lost a Credit Card Trying to Get One

Aravind Viswanathan — Thu, 09 Apr 2026 18:00:55 GMT

There’s a certain kind of confidence that comes with upgrading a credit card.

It feels like a small step up — better limits, better perks, better… access to airport lounges you may or may not use.

So I did what seemed reasonable.

I tried to get a better one.

It started well.

There were conversations.
There was movement.
And, more importantly, there was assurance.

“It’s in progress, sir.”
“Approval is done.”
“We’ll initiate.”

At that point, I wasn’t just hopeful.

I was planning.

Then came the process.

Not a process.
More like a collection of philosophical ideas loosely connected by follow-ups.

Every few days:

“Any update?”

And every time, there was one.

“We’ll re-initiate.”
“It’s a long process.”
“Without invite it’s difficult.”
“But we’ve got approval.”

This was my first introduction to the concept of:

Schrödinger’s Credit Card
Approved and not initiated at the same time.

Documentation added depth to the experience.

My PAN card, for instance, turned out to be quite selective.

Valid for some things
Not valid for others
Valid if signed
Not valid if photocopied
Possibly valid depending on mood

I began to suspect the issue wasn’t the document.

It was me not understanding its personality.

But the real turning point was subtle.

At some point, I stopped being excited about the new card
and started getting tired of the process.

And when that happens, you start simplifying.

“Let’s not proceed with the card.”

A simple message. No drama.

And then, in a moment of quiet clarity (or frustration, depending on how you look at it), I did something unexpected.

I closed my existing credit card.

So just to recap:

I started with one credit card.
Tried to get a better one.
And ended up with… fewer.

If you think about it, it’s quite efficient.

No upgrade.
No confusion.
Just subtraction.

The interesting part is not the outcome.

It’s what the process tells you.

Because getting something should feel like movement.

Not like maintenance.

Somewhere between “approval” and “re-initiation,”
I realised something simple:

If something this small can’t be taken to closure,
it’s worth thinking about how larger things would be handled.

The new card may still happen someday.

Maybe it’s still in progress.

But for now, I have one less card,
and one more story.

What Zero Doesn’t Tell You

Aravind Viswanathan — Tue, 31 Mar 2026 17:20:52 GMT

The Comfort of a Number

Zero is comforting.

It suggests completion. Certainty. A clean state. In systems that are otherwise complex and difficult to reason about, a number like zero offers something rare — closure.

And that is precisely why it can be misleading.

A Mismatch

Metrics usually begin as signals. They help us observe a system, compare states, and make decisions with some degree of objectivity. Over time, however, they tend to become targets. Once that happens, the relationship between the metric and the underlying reality starts to weaken.

The number remains. The meaning begins to drift.

I ran into this recently in a system where “zero vulnerabilities” had become an important goal.

Two base images were being evaluated.

One passed cleanly — no reported findings. The other did not.

At first glance, the decision seemed straightforward. If the objective is zero, and one option achieves it, the system appears to have made the choice for you.

But something about that conclusion did not sit right.

Not because the number was incorrect, but because it conflicted with an expectation I had about the systems themselves.

One of the images was distroless — intentionally minimal, designed to include only what is required to run the application and little else. The other included a broader runtime environment, with more built-in capabilities available inside the container.

Instinctively, the distroless image felt more constrained. There was simply less present that could be used if something went wrong.

And yet, the scan results suggested the opposite.

The distroless image reported more findings. The other appeared clean.

At that point, the question was no longer which image to choose.

It was: what exactly were we measuring?

What We Were Measuring

Looking closer did not reveal an issue in how the system behaved. It revealed a limitation in how the system was being observed.

What we had was not a direct measure of risk, but a measure of what the scanning process could identify, based on its data sources and reporting model.

The number was accurate within that context.

But the context itself was narrower than the system it was being used to represent.

The situation became clearer when we looked at the same images through different scanners.

The results were not identical.

The counts shifted. Some findings appeared, others disappeared.

Nothing about the underlying system had changed.

Only the lens had.

Which meant the number we were optimizing for was not just a function of the system — but also of how it was being observed.

When Metrics Become Targets

There is a broader pattern here.

It is often described through Goodhart’s Law:

When a metric becomes the goal, the system begins to optimize the metric rather than the outcome it was intended to capture.

This is not unique to security. It is a general property of optimization systems.

In reinforcement learning, an agent maximizes a reward signal. If the reward function is well-designed, this leads to the desired behavior. If it is not, the agent finds ways to maximize the signal while drifting away from the original intent.

A well-known example is a simulated boat racing environment from OpenAI, where the agent learned to loop in place collecting reward points instead of completing the race. The system did not fail. It optimized exactly what it was given.

The behavior was correct with respect to the metric — and incorrect with respect to the outcome.

The system does not fail.

It does exactly what it was asked to do.

Just not what was meant.

The same dynamic appears in organizational systems.

If vulnerability count becomes the primary measure of security, teams will optimize for reducing that number.

Over time, the system adapts around the metric.

The number improves.

The underlying risk may not.

There is a more subtle implication here.

If the metric had been applied strictly — zero findings as a hard requirement — the distroless image would have been rejected outright.

Not because it exposed more capability.
Not because it was more exploitable.

But because it surfaced more findings in a particular scanning context.

The system would have optimized for the metric.

And in doing so, potentially selected an option that was less constrained at runtime.

The outcome would still satisfy the policy.

But not necessarily the intent.

A Different Way to Look at the Same System - 3 Pillar Framework

CVE count, in this sense, behaves like a proxy.

Useful, but incomplete.

When treated as the objective, it begins to exhibit the same characteristics as a poorly designed reward function — easy to optimize, but not always aligned with what we actually care about.

The shift in thinking, when it happened, was not dramatic.

We stopped asking which option had fewer reported vulnerabilities, and started asking what risk we were actually trying to reduce.

That led to a different way of evaluating the same system — not as a number, but along three dimensions.

The first pillar was about fixability.

Were there vulnerabilities with available fixes that had not yet been applied?

The second pillar was about exploitability.

Among the remaining issues, how likely are they to be used in practice?

The third pillar was about exposure at runtime.

If something were to get through, what capabilities would be available inside the system?

These pillars aka questions did not contradict the metric.

They extended it.

And they changed the answer.

In this case,

Pillar 1: Both images had reached a point where there were no immediate fixes to apply.

Pillar 2: Both showed low likelihood of exploitation based on available data.

Pillar 3: One image exposed more capability at runtime (and that was distroless).

The distinction lay in what the system allowed once running.

The other constrained it.

The number had pointed in one direction.

The system, when examined through a different lens, pointed in another.

What Changes When You Look Differently

This way of thinking is not new.

It is reflected, in different forms, in existing security guidance.

NIST SP 800-190 treats vulnerability management as a continuous, risk-based process rather than a binary state.

The CIS Docker Benchmark emphasizes reducing attack surface — removing unnecessary components, limiting capabilities, and constraining what is available at runtime.

The principles are well understood.

What varies is how they are applied in practice.

This alignment is not accidental.

Most mature security frameworks do not define security as the absence of findings, but as the presence of effective controls and risk management.

Standards like SOC 2 and ISO 27001 focus on how vulnerabilities are identified, assessed, and managed over time — not on achieving a static count.

Even stricter environments such as FedRAMP operate on continuous monitoring and risk acceptance, rather than assuming that “zero findings” is a meaningful or achievable steady state.

The emphasis, consistently, is on managing risk — not eliminating numbers.

Metrics, and What They Hide

Metrics are abstractions.

They compress a complex system into something that can be observed, compared, and optimized.

That compression is valuable.

But it also hides detail.

The goal, then, is not to eliminate metrics.

They remain essential.

But they cannot be the decision.

They are a starting point.

The rest requires judgment.

Because in systems of any real complexity, the cleanest number is not always the clearest signal.

And sometimes, zero is not the end of the story.

It is simply a reflection of what we chose to see — and what we didn’t..

API Maturity in the Age of AI

Aravind Viswanathan — Thu, 12 Mar 2026 04:38:54 GMT

The rise of AI agents, tool ecosystems, and Model Context Protocol (MCP) integrations is changing how software interacts with APIs.

For years, APIs were designed primarily for human developers.
Developers read documentation, understood workflows, and wrote integration code.

Increasingly, however, APIs are being consumed by machines.

Agents invoke tools.
LLMs orchestrate workflows.
MCP servers expose capabilities dynamically.

In this world, APIs are no longer just interfaces for developers.
They are interfaces that machines must be able to understand and navigate.

Which raises an important question:

How easy is it for a machine to understand and navigate your API?

And this raises an interesting question:

How easy is it for a machine to understand and navigate your API?

Interfaces that are structured, predictable, and semantically meaningful are easier for both humans and machines to understand.

Almost two decades ago, Leonard Richardson proposed a framework that captures this idea remarkably well.

It is called the Richardson Maturity Model (RMM).

What this article is about...

This article explores:

Why API structure matters even more in the age of AI agents and MCP
What the Richardson Maturity Model (RMM) is
How APIs evolve across the four maturity levels
Why most real-world APIs stop at Level 2
How API maturity affects machine-driven integrations

What is the Richardson Maturity Model?

The Richardson Maturity Model describes how effectively an API uses the features of the web.

At its core, the model asks a simple question:

Are we using HTTP as a true application protocol, or merely as a transport mechanism?

The more an API leverages the semantics of the web — resources, HTTP verbs, status codes, and hypermedia — the more mature it is considered.

The model defines a progression of maturity levels.

RMM Levels at a Glance

Level	Core Idea	API Style
Level 0	HTTP as transport	Single endpoint, payload determines operation
Level 1	Resources	Multiple URIs but operations encoded in the path
Level 2	HTTP semantics	Proper use of verbs and status codes
Level 3	Hypermedia	APIs describe possible next actions

Engineering Insight

Interfaces that are structured, predictable, and semantically meaningful are easier for both humans and machines to understand.

Example: Coffee Ordering API

To demonstrate the model, imagine we are building an API for a coffee ordering application.

The system supports three operations:

Fetch the menu
Place an order
Check the order status

Level 0 — One Service to Rule Them All

Level 0 represents the lowest level of maturity.

HTTP is used purely as a transport layer, and the payload determines the operation.

Endpoint

POST /coffee

Example Request — Fetch Menu

{
  "operation": "getMenu"
}

Example Request — Order Coffee

{
  "operation": "orderCoffee",
  "coffeeType": "Latte",
  "size": "Large"
}

Example Request — Order Status

{
  "operation": "getOrderStatus",
  "orderId": 123
}

Level 1 — Resources

Level 1 introduces multiple URIs representing resources, but still does not fully leverage HTTP semantics.

Example endpoints:

GET /getMenu
POST /orderCoffee
GET /getOrderStatus?orderId=123

Example Response

{
  "orderId": 123,
  "status": "PREPARING"
}

Level 2 — HTTP Verbs

Level 2 introduces proper use of HTTP verbs and status codes.

This is the level most modern APIs operate at.

Fetch Menu

GET /menu

Response

{
  "items": [
    {
      "id": 1,
      "name": "Latte",
      "sizes": ["Small","Medium","Large"]
    }
  ]
}

Create Order

POST /orders

Request

{
  "itemId": 1,
  "size": "Large"
}

Response

{
  "orderId": 123,
  "status": "PREPARING"
}

Level 3 — Hypermedia (HATEOAS)

Level 3 introduces Hypermedia as the Engine of Application State.

Responses include links describing possible next actions.

Example Response

{
  "orderId": 123,
  "status": "PREPARING",
  "links": {
    "self": "/orders/123",
    "cancel": "/orders/123/cancel",
    "status": "/orders/123"
  }
}

Why Most APIs Stop at Level 2

Although Level 3 represents the highest level of maturity, most production APIs stop at Level 2.

Several practical reasons explain this:

Client complexity

Hypermedia-driven APIs require clients to dynamically interpret links.

Tooling limitations

Most tooling (OpenAPI, SDK generators, API gateways) assumes explicit endpoints, not hypermedia navigation.

Frontend expectations

Frontend systems prefer predictable contracts rather than dynamic discovery.

Because of these factors, Level 2 has effectively become the practical standard.

AI & MCP Perspective

AI agents interact with APIs very differently from human developers.

Instead of reading documentation, they rely on clear structure,

predictable semantics, and discoverable workflows.

APIs that follow good REST principles naturally provide these signals.

RMM in the Age of AI and MCP

The rise of AI agents, tool ecosystems, and Model Context Protocol (MCP) integrations makes the ideas behind the Richardson Maturity Model relevant again.

AI systems interact with APIs very differently from human developers.

They don’t read documentation or follow predefined workflows.
They rely on signals exposed by the API itself.

These signals include:

Clear resources
Predictable operations
Consistent response structures
Discoverable workflows

APIs that follow strong design principles naturally provide these signals.

At Level 0, APIs expose very little structure.
Interactions are opaque and require interpretation, making them difficult for machines to use reliably.

By Level 2, APIs become significantly easier for automated systems to interact with.
Resources, operations, and outcomes are explicit and predictable.

Level 3 — with hypermedia-driven navigation — goes a step further.
It begins to resemble machine-discoverable workflows, where the API itself guides what can be done next.

This aligns closely with how AI agents explore tools and how MCP-style systems expose capabilities dynamically.

What enables this is not just better use of HTTP, but clearer expression of the underlying system.

For example:

GET /menu/{menuItem}/options

This does more than return data.

It expresses that a menu item has configurable options.

When extended further:

GET /menu/{menuItem}/sizes  
GET /menu/{menuItem}/milk-options

Different aspects of the system become independently addressable.

This makes the API easier to explore, reason about, and compose into workflows — especially for machines.

API maturity, therefore, is not just about correctness.

It is about how easily another system can understand and navigate what is exposed.

In a world where APIs are increasingly consumed by AI agents, clarity of structure and intent becomes a fundamental requirement.

Summary

The Richardson Maturity Model provides a useful lens for understanding API design.

Level 0 — One Service to Rule Them All
A single endpoint controls everything, with behavior hidden in the payload.
Level 1 — One Method to Rule Them All
Resources appear, but intent is still implicit and constrained.
Level 2 — Federalism
Structure emerges. Responsibilities are distributed, and interactions become more predictable — no longer one power controlling everything.
Level 3 — Cooperative Federalism
The system becomes navigable. It begins to guide clients through possible next actions — almost as if the path is being shown, not guessed.

In practice, most APIs stop at Level 2 — and for good reason.

But the real value of the model is not in reaching Level 3.

It is in asking a more important question:

How clearly does your API express the system it represents?

At lower levels, interacting with an API feels like dealing with a powerful but opaque system — everything is possible, but nothing is obvious.

As maturity increases, structure emerges.
Intent becomes clearer.
Navigation becomes easier.

And at the highest level, the system begins to guide you — not unlike a well-charted world where the next step is always visible.

In a world where APIs are increasingly consumed by AI agents, not just developers, this clarity is no longer optional.

Because for machines, the API is not just an interface.

It is the only way the system reveals itself.

When Consumers Fail: Extending Selective Retry to Kafka

Aravind Viswanathan — Wed, 04 Mar 2026 14:58:50 GMT

Designing a Kafka Error Pipeline

Kafka consumers fail for the same reason every distributed component fails: not all errors mean the same thing.

A network timeout, a malformed payload, and a downstream outage are all failures — but treating them the same leads to the same bad outcomes: wasted retries, stalled partitions, and messages cycling through retry loops that were never going to succeed.

Most Kafka consumer setups do exactly that: they retry everything until a limit is reached and only then decide what to do with the message.

That works — until it doesn't.

In the previous post, Not All Failures Deserve a Second Chance, the focus was classification: a typed exception carrying an ErrorType, and a retry policy that understands the semantics of failure instead of guessing from exception classes.

That article covered the outbound side of a system — REST clients, Kafka producers, and service calls.

This post moves to the other side of the boundary: the Kafka consumer pipeline.

The classification already exists.
The retry policy already exists.

What remains is shaping the consumer pipeline so it respects those decisions.

Consumer Failure Pipeline

Before looking at the code, it helps to picture the error flow at a system level.

The diagram shows how a consumer failure is classified and how that classification determines the routing path.

Two decisions drive the entire pipeline:

Is the failure retryable?
This is decided once by the SelectiveRetryPolicy, which classifies the error according to its semantics (transient vs. permanent, etc.).
If retries exhaust, where does the message go next?
That routing decision is determined by configuration (for example: retry topic, dead-letter topic, parking lot, or discard).

Separating classification from routing keeps the error-handling logic predictable: classification happens once, and subsequent routing behaviour follows from that classification and your configured policies.

What DefaultErrorHandler Doesn't Do

Spring Kafka's DefaultErrorHandler retries up to a configured limit and routes to a recoverer on exhaustion.

That behaviour is useful — but it applies the same retry loop regardless of why the record failed.

A VALIDATION_ERROR receives the same retry attempts as a transient SERVICE_SERVER_ERROR. Those retries accomplish nothing. The payload will not change between attempts. The only effect is that the partition is blocked while the consumer repeatedly processes something it already knows will fail.

DefaultErrorHandler does not decide whether something should be retried.

It simply retries.

The solution is to place a handler in front of it that decides whether the record should enter that retry loop at all.

CompositeKafkaErrorHandler

When a Kafka consumer throws, Spring invokes handleOne() on the configured CommonErrorHandler — in most apps this is DefaultErrorHandler. CompositeKafkaErrorHandler wraps that handler and introduces one additional decision point before delegating.

It composes three things:

a SelectiveRetryPolicy (from the previous article)
a recoverer for retryable errors whose retries are exhausted
a recoverer for non‑retryable errors

The two recoverers represent different outcomes: a transient error that exhausts retries and a validation/semantic failure are both routed somewhere observable, but they should not reach that destination in the same way or at the same time in the flow.

public class CompositeKafkaErrorHandler>
        implements CommonErrorHandler {

    private final DefaultErrorHandler defaultErrorHandler;
    private final SelectiveRetryPolicy retryPolicy;
    private final BiConsumer, Exception> exhaustedRecoverer;
    private final BiConsumer, Exception> nonRetryableRecoverer;

    public CompositeKafkaErrorHandler(
            SelectiveRetryPolicy retryPolicy,
            BackOff backOff,
            BiConsumer, Exception> exhaustedRecoverer,
            BiConsumer, Exception> nonRetryableRecoverer) {
        this.retryPolicy = retryPolicy;
        this.exhaustedRecoverer = exhaustedRecoverer;
        this.nonRetryableRecoverer = nonRetryableRecoverer;
        this.defaultErrorHandler = new DefaultErrorHandler(
                (record, ex) -> exhaustedRecoverer.accept(record, ex),
                backOff);
    }

    // ... delegate handleOne() to defaultErrorHandler after consulting retryPolicy ...
}

The generics mirror the retry policy: >.

Importantly, the handler is policy-driven: it doesn't depend on ApplicationException or an ErrorType enum directly. It only asks the SelectiveRetryPolicy whether a failure is retryable and routes accordingly, keeping the handler reusable across services with different exception hierarchies.

Routing the Failure

handleOne() performs the routing.

@Override
public boolean handleOne(Exception thrownException, ConsumerRecord record,
Consumer consumer, MessageListenerContainer container) {

// Deserialization failures occur before business logic runs.
if (thrownException.getCause() instanceof DeserializationException) {
nonRetryableRecoverer.accept(record, thrownException);
commitOffset(record, consumer);
return true;
}

// Ask the retry policy first.
if (isNonRetryableError(thrownException)) {
nonRetryableRecoverer.accept(record, thrownException);
commitOffset(record, consumer);
return true;
}

// Retryable → delegate to DefaultErrorHandler
return defaultErrorHandler.handleOne(thrownException, record, consumer, container);
}

The first branch handles deserialization failures.

These occur before business logic runs, meaning the typed exception hierarchy is not involved. DeserializationException.getData() contains the raw bytes of the message. Those bytes must be preserved and sent somewhere observable before the record is skipped.

Silently dropping malformed messages means you have no record that they ever arrived.

The second branch asks the retry policy whether the failure is non‑retryable. The policy evaluates the typed error classification first and falls back to exception‑class rules if necessary.

Non‑retryable failures skip the retry loop entirely.

The record is routed to the non‑retryable recoverer and the offset is committed at offset + 1. This tells Kafka the record has been handled and the consumer should move forward.

Because this path bypasses DefaultErrorHandler, the handler must also take responsibility for committing the offset itself.

Without that commit the consumer would stall on the same record indefinitely.

If neither branch matches, the exception is considered retryable and DefaultErrorHandler takes over using the configured BackOff.

When retries are exhausted, the recoverer wired into the constructor invokes exhaustedRecoverer.

Recoverers Belong in Configuration

The handler is infrastructure. It understands how to route failures but not where they should go.

Destination policy belongs in configuration.

CompositeKafkaErrorHandler therefore accepts two BiConsumer, Exception> functions:

BiConsumer, Exception> exhaustedRecoverer = (record, exception) -> 
{
        if (retryProperties.isEnabled()) {
                errorPublisher.publishToRetry1m(record, exception);
        } else {
                ApplicationException appEx = exception instanceof ApplicationException ae ? ae: ApplicationException.of(ErrorType.SERVICE_SERVER_ERROR, "Retries exhausted: " + exception.getMessage());
                errorPublisher.publishToDlq(record.topic(), record.value(), appEx, record);
        }
};

BiConsumer, Exception> nonRetryableRecoverer = (record, exception) -> 
{
        if (exception.getCause() instanceof DeserializationException) {
                errorPublisher.publishDeserializationFailureToDlq(record, exception);
        } else {
                errorPublisher.publishBatchRecordToDlq(record, exception);
        }
};

return new CompositeKafkaErrorHandler<>(kafkaConsumerRetryPolicy, backOff, exhaustedRecoverer, nonRetryableRecoverer);

The handler remains reusable across services.

Only the routing behaviour changes — which DLQ topic to use, whether retry topics are enabled, or how generic exceptions should be wrapped..

The Retry Chain

In‑process retries are designed for milliseconds of instability, not minutes of outage.

They handle short‑lived issues such as network jitter or brief service hiccups. They do not solve problems like dependency restarts, rate limits that reset on minute boundaries, or a database mid‑failover.

For those situations the message needs time.

When in‑process retries exhaust on a retryable error, the record is moved to a separate Kafka topic instead of going directly to the DLQ.

A dedicated consumer processes source-topic-retry-1m, waiting at least one minute before redelivery. If retries exhaust again, the message progresses to source-topic-retry-5m, and finally to the DLQ if all stages fail.

Non‑retryable failures bypass this chain entirely.

A validation error that occurs at 10:00:00 should be visible in the DLQ at 10:00:00 — not at 10:06:00 after exhausting retry stages it had no reason to enter.

The chain is controlled through configuration:

kafka:
  retry:
    enabled: ${KAFKA_RETRY_ENABLED:true}
    suffixes:
      first-stage: ${KAFKA_RETRY_1M_SUFFIX:retry-1m}
      second-stage: ${KAFKA_RETRY_5M_SUFFIX:retry-5m}
  error:
    dlq-suffix: ${KAFKA_DLQ_SUFFIX:-dlq}

Setting kafka.retry.enabled=false routes exhausted retryable errors directly to the DLQ in environments that do not provision retry topics.

What is not optional is the DLQ itself.

A consumer that silently drops failed records has no observable error signal.

One Classification, Two Contexts

The SelectiveRetryPolicy driving CompositeKafkaErrorHandler is the same policy used for outbound retries.

The failure taxonomy is defined once.

Whether VALIDATION_ERROR should be retried is not a decision repeated in HTTP clients, Kafka consumers, and service calls. The behaviour follows directly from the ErrorType definition.

Add a new ErrorType variant with the correct semantics and it automatically receives the right behaviour everywhere — inbound and outbound.

That is what it means for retry logic to be readable and auditable.

Not just in one place.

Across the system.

Not All Failures Deserve a Second Chance

Aravind Viswanathan — Wed, 25 Feb 2026 06:06:41 GMT

Retry logic is one of those things that feels simple until it isn’t.

You add @Retryable, set maxAttempts = 3, and move on. Everything works — until the system happily retries a validation error three times, re-publishes a message that was never going to fit in Kafka, or quietly fails to retry something that actually was transient.

At that point, retry logic stops being resilience. It becomes noise with a delay.

This post describes a pattern we introduced to prevent that kind of misconfiguration from ever becoming a production problem: a selective retry policy that makes retry intent explicit and centrally defined.

What Spring Retry Already Gives You

Spring Retry’s SimpleRetryPolicy allows exception-class-based configuration:

Map, Boolean> retryable = new HashMap<>();
retryable.put(SocketTimeoutException.class, true);
retryable.put(ValidationException.class, false);

new SimpleRetryPolicy(3, retryable, true);

This works — but only at the class level.

In practice:

A single ApplicationException may represent multiple failure modes.
Configuration becomes scattered across services.
There is no protection against marking the same exception as both retryable and non-retryable.

Retries are not about exception structure.

They are about failure semantics.

The Core Insight: Failures Have Meaning

Broadly speaking:

Transient failures

Temporary issues — network blips, timeouts, infrastructure instability.
Retries help.

Permanent failures
Invalid input, malformed payloads, authentication failures.
Retries waste time.

Retry logic should understand that distinction.

Step One: A Typed Exception with an Error Enum

Instead of proliferating exception classes, introduce a single typed exception:

public enum ErrorType {
    VALIDATION_ERROR,
    SERVICE_CLIENT_ERROR,   // 4xx
    SERVICE_SERVER_ERROR,   // 5xx
    MESSAGING_ERROR,
    AUTHENTICATION_ERROR,
    UNKNOWN_ERROR
}

public class ApplicationException extends RuntimeException {

    private final ErrorType type;

    public ApplicationException(ErrorType type, String message) {
        super(message);
        this.type = type;
    }

    public ApplicationException(ErrorType type, String message, Throwable cause) {
        super(message, cause);
        this.type = type;
    }

    public ErrorType getType() {
        return type;
    }
}

Now the exception carries semantic intent.

The retry policy no longer needs to infer behaviour from class names or messages.

Step Two: The Selective Retry Policy

SelectiveRetryPolicy extends SimpleRetryPolicy and introduces:

retryable exception classes
non-retryable exception classes
retryable error types
non-retryable error types
strict validation against overlap

Overlap Protection

If an error type or exception class is configured as both retryable and non-retryable, the application fails at startup.

For example, this configuration is rejected:

SelectiveRetryPolicy.builder()
    .retryOnErrorType(ErrorType.SERVICE_SERVER_ERROR)
    .doNotRetryOnErrorType(ErrorType.SERVICE_SERVER_ERROR) // Illegal
    .build();

Misconfiguration is treated as a configuration error — not a runtime surprise. The actual policy is here

Precedence Model

The evaluation order is explicit:

1. Typed exception error type rules
   └── Explicitly non-retryable → do not retry
   └── Explicitly retryable     → retry
   └── Otherwise                → fall through

2. Exception class rules
   └── Explicitly non-retryable → do not retry
   └── Explicitly retryable     → retry
   └── Otherwise                → default

3. Default behaviour → retry

Non-retryable rules always take precedence.

If no rule matches, the policy defaults to retry.

This is intentional and can be flipped if a fail-safe approach is preferred.

Usage Example 1 — REST Client

Define the retry template once:

@Bean("serviceRetryTemplate")
public RetryTemplate serviceRetryTemplate() {

    var policy = SelectiveRetryPolicy.builder()
        .maxAttempts(3)
        .typedExceptionClass(ApplicationException.class)
        .errorTypeExtractor(ApplicationException::getType)
        .retryOnException(SocketTimeoutException.class)
        .doNotRetryOnErrorType(ErrorType.VALIDATION_ERROR)
        .doNotRetryOnErrorType(ErrorType.SERVICE_CLIENT_ERROR)
        .build();

    var backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(1000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(10000L);

    RetryTemplate template = new RetryTemplate();
    template.setRetryPolicy(policy);
    template.setBackOffPolicy(backOff);
    return template;
}

Option A — Explicit RetryTemplate

@Service
public class TrackingServiceClient {

    private final RestTemplate restTemplate;
    private final RetryTemplate retryTemplate;

    public TrackingDTO getTracking(String shipmentId) {
        return retryTemplate.execute(ctx ->
            restTemplate.getForObject("/tracking/{id}", TrackingDTO.class, shipmentId)
        );
    }
}

This makes retry behaviour visible at the call site.

Option B — Declarative via @Retryable

@Bean("selectiveRetryInterceptor")
public RetryOperationsInterceptor selectiveRetryInterceptor() {
    RetryOperationsInterceptor interceptor = new RetryOperationsInterceptor();
    interceptor.setRetryOperations(serviceRetryTemplate());
    return interceptor;
}

@Service
public class TrackingServiceClient {

    private final RestTemplate restTemplate;

    @Retryable(interceptor = "selectiveRetryInterceptor")
    public TrackingDTO getTracking(String shipmentId) {
        return restTemplate.getForObject("/tracking/{id}", TrackingDTO.class, shipmentId);
    }
}

The retry contract remains centralised.

The annotation simply references it.

Usage Example 2 — Kafka Producer

Kafka introduces its own mix of transient and permanent failures.

@Bean("kafkaRetryTemplate")
public RetryTemplate kafkaRetryTemplate() {

    var policy = SelectiveRetryPolicy.builder()
        .maxAttempts(3)
        .typedExceptionClass(ApplicationException.class)
        .errorTypeExtractor(ApplicationException::getType)
        .retryOnException(KafkaException.class)
        .retryOnException(NetworkException.class)
        .retryOnException(RetriableException.class)
        .doNotRetryOnException(SerializationException.class)
        .doNotRetryOnException(RecordTooLargeException.class)
        .doNotRetryOnException(InvalidTopicException.class)
        .doNotRetryOnErrorType(ErrorType.VALIDATION_ERROR)
        .build();

    var backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(2000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(30000L);

    RetryTemplate template = new RetryTemplate();
    template.setRetryPolicy(policy);
    template.setBackOffPolicy(backOff);
    return template;
}

Apply it transparently using a proxy:

@Bean
public BeanNameAutoProxyCreator notificationProducerProxy() {
    BeanNameAutoProxyCreator creator = new BeanNameAutoProxyCreator();
    creator.setBeanNames("notificationsProducer");
    creator.setInterceptorNames("kafkaRetryInterceptor");
    return creator;
}

Producer remains clean:

@Service("notificationsProducer")
public class NotificationsProducer {

    private final KafkaTemplate kafkaTemplate;

    public void publish(String topic, Object message) {
        kafkaTemplate.send(topic, message);
    }
}

Why This Exists (Preventative Design)

The motivation here was not a dramatic outage.

It was to prevent two quiet failure modes from creeping in over time:

Missed retries — transient failures wrapped in generic exceptions that never get retried.
Over-retries — permanent failures retried repeatedly because the policy cannot distinguish them.

Both come from retry logic that lacks semantic awareness.

This pattern removes that ambiguity.

The Trade-off

This approach shifts responsibility to the throw site.

If an error is classified incorrectly, retry behaviour will also be incorrect.

That is deliberate.

Inferring retry intent from HTTP codes or exception messages tends to become brittle and distributed.

Encoding intent explicitly at the throw site keeps retry behaviour centralised and predictable.

Closing Thoughts

Retries are not a technical detail.

They are part of system behaviour.

A retry policy that does not understand failure semantics is not resilience.

It is optimism with overhead.

Spring Retry provides the foundation.

Adding a semantic layer on top makes retry behaviour readable, auditable, and difficult to misconfigure.

In a multi-service codebase, that clarity is usually the difference between retry logic you trust and retry logic you tolerate.

Bhishma, Nuremberg, and When Humans Stop Thinking

Aravind Viswanathan — Wed, 28 Jan 2026 06:02:15 GMT

Bhishma is one of the most tragic figures in the Mahabharata—not because he lacked moral clarity, but because he had too much of it and still failed to act.

In the epic, Bhishma is not a villain or a bystander. He is the most respected elder of the kingdom, a master of law and ethics—the person everyone agrees understands dharma better than anyone else in the room.

He knew what was right.
He knew what was wrong.
And yet, when it mattered most, he stood still.

Bhishma’s justification was simple and devastatingly familiar: “I am bound by my duty.”
As a young man, he had taken an extraordinary vow—to renounce personal power and ambition forever, and to serve the throne unquestioningly, no matter who occupied it or how it was used.

That vow made him incorruptible.
It also made him immobile.

Duty to the vow he could not break.
Duty to the throne he would not abandon.
Duty to the system he had helped uphold, even as it decayed.

He was not evil.
He was obedient.

That distinction matters, because history has taught us—repeatedly—that some of the worst failures don’t come from malice. They come from people who stop thinking at precisely the moment when thinking is required.

The Nuremberg Defense Was Not About Monsters

After World War II, many of the accused at the Nuremberg Trials offered the same explanation: “I was only following orders.”

The world rejected that defense—not because it denied coercion or hierarchy, but because it denied moral agency.

The ruling was clear:
Obedience does not erase responsibility.
Systems do not commit crimes. People do.

What’s striking is that the Nuremberg defense wasn’t about cruelty—it was about abdication. About handing over judgment to authority and calling it duty.

If this feels uncomfortably close to Bhishma, it should.

AI Didn’t Create This Problem

AI didn’t invent moral abdication.
It just made it scalable.

We’re entering a phase where AI systems don’t just suggest—they act.
They open pull requests.
They recommend architectural changes.
They flag risks, approve workflows, and increasingly, execute decisions.

And with that comes a new, dangerously convenient sentence:

“The system decided.”

This is not a technical statement.
It’s a moral one.

AI doesn’t remove responsibility—it creates plausible deniability at machine speed.

What I’ve Seen in Practice

In real engineering teams, the story isn’t fear or resistance. It’s subtler.

First, people struggle just to use AI effectively.
Then quality issues appear—because AI will generate something even when standards are vague.
Only later does the hardest realization land: ownership never left the human.

AI-generated code doesn’t fail differently.
It fails faster.

What changes is not the nature of responsibility, but its concentration. A single decision, poorly reviewed, can now propagate across a codebase at a speed we’ve never had to manage before.

Interestingly, AI also shifts where effort lives. Typing becomes cheap. Thinking becomes expensive. Code reviews move away from syntax toward architecture, correctness, and long-term consequences.

AI doesn’t replace judgment.
It exposes how much we relied on muscle memory instead of thought.

The New Bhishma Risk

The modern Bhishma won’t stand on a battlefield.

He’ll sit behind dashboards.
He’ll approve pipelines.
He’ll trust systems that are “working as designed.”

And like Bhishma, he won’t be wrong because he lacked knowledge—
but because he treated faithful execution as righteousness, and stopped questioning when questioning was required.

History shows us that the hardest failures don’t come from bad intent.
They come from people faithfully executing systems without enough questioning built in.

Krishna’s Correction (Without the Theology)

Krishna’s message to Arjuna was not “do your duty blindly.”

It was: think, discern, and then act—without hiding behind outcomes.

Translated into modern systems thinking:

Delegation is not abdication
Automation is not absolution
Non-attachment does not mean non-responsibility

AI can assist judgment.
It cannot replace it.

The moment we let systems think for us instead of with us, we recreate the conditions that history has already judged harshly.

A Quiet Test

Here’s a simple question worth asking as AI becomes more autonomous:

If this decision causes harm, who would say “I own this”?

If the answer is vague, distributed, or points to “the system,” we already know how that story ends.

We’ve read it before.
In epics.
In court transcripts.
And now, quietly, in our codebases.

The real risk isn’t artificial intelligence.

It’s when humans stop thinking—and call it progress.

Resilience

Aravind Viswanathan — Sun, 11 Jan 2026 16:22:01 GMT

What is Resilience?

Introduction

Resilience is one of the most important aspects of any system - more so with the cloud wherein we have a true Distributed System! And when we say a distributed system, the resilience parameters are affected by the evergreen paper - ***Fallacies of Distributed Computing! ***

This blog is a two part series -

Fallacies of Distributed Computing
How are these addressed in a micro-service architecture?

Before we get into how to go about building resilient systems, we need to understand the word - Resilience!

What is Resilience?

The following are the definitions that we come across when we lookup the word Resilience

The English Language

The capacity to recover quickly from difficulties; toughness

Computer Architecture

It is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation

Micro-services

The ability of a microservice to function in spite of errors in the dependent services

In other words, failure of one service should not result in the failure of the system as a whole.

Why is Resilience Important?

Well, this can be summarised by this beautiful line -

The oak fought the wind and was broken, the willow bent when it must and survived.
Jordan, R. (1994). The Fires of Heaven. 2nd ed. U.S: Tor Books

What this effectively means is that any system that we build should not be hardened against failure rather we should be able to cope with the problems that arise and ensure that we are up and running. If we are hardened against failure, we are like the oak tree! There could always be a wind strong enough to break us!

Most importantly, we can always try to reduce failures but we would never be able to erode it - In the simplest case, a failure could be a burnt drive and in the worst case it could be a burnt data center. We don't have control over either of these! Consequently, this implies that we should be able to handle these "never" eliminated failures! This is where we have backup disks and Disaster Recovery (DR) centers (costing us the big bucks!)

When it comes to micro-service, the architecture by itself brings in a level of resilience! We have a highly decoupled system with cohesive services!

Enough has been spoken about micro-services that I don't need to elaborate on its usefulness (and at times pain!)

The most important aspect when it comes to Resilience - especially in Cloud and Distributed Architectures are to ensure that the assertions in the paper The Fallacies of Distributed Computing by Peter Deutsch and others (as part of erstwhile Sub Microsystems) are addressed!

Fallacies of Distributed Computing

These fallacies mostly arise due to the injection of network within the architecture or application. In order to achieve resilient applications, these fallacies must be addressed

The fallacies are -

Network is reliable - A network is never reliable!
Latency is zero - A latency can be reduced by optimising network but can never be zero....a network packet takes time to move from one point to another and this adds latency!
Infinite bandwidth - There are physical and logical limitations on the bandwidth provided by the network.
Network is secure - A chain is only as strong as its weakest link or so they say! If network were secure, we wouldn't have many data leaks!
Network topology is constant - We always need to assume that the network topology is going to change. A router on the way can be go kaput.. a new firewall rule is introduced! Only change is constant!
Transport cost is zero - Every network call has an associated cost!
Network is homogeneous - In the real world and the digital world, we have our differences and we need to ensure that we handle it!

A micro-service architecture brings in complexities in this area as we now have a highly distributed system!

We'll look at how these fallacies are/can be addressed in micro-service architecture in the next blog!

Hello!

Aravind Viswanathan — Fri, 26 Dec 2025 07:57:51 GMT

Hi 👋 I'm Aravind

I'm a Principal Engineer / Architect with 20+ years of experience designing and evolving large-scale, distributed platforms. Over the years, my work has naturally converged at the intersection of architecture, platform engineering, AI governance, and organizational effectiveness.

I enjoy working on systems that have to hold up over time — systems that scale, stay operable in production, and remain trustworthy as complexity and organizational size grow. I care deeply about clear domain boundaries, thoughtful trade-offs, and technical decisions that still make sense years later.

🧭 What I spend most of my time on

Platform & Distributed Systems Architecture

I design and evolve high-throughput, fault-tolerant systems with a strong bias toward clarity of ownership, predictable behavior in production, and operational simplicity over clever abstractions.

Domain-Driven Design at scale

I’ve spent a lot of time helping teams define bounded contexts, align technical boundaries with organizational reality, and avoid the slow erosion that comes from accidental coupling between domains.

AI Platform Design & Governance

As AI systems started becoming part of core platforms, I focused on building governance models that don’t kill innovation but still create accountability and trust.

This includes:

Designing model lifecycle and accountability frameworks
Building AI-assisted developer tooling
Ensuring AI-enabled systems are observable and operable in production

API & Integration Strategy

I approach APIs as long-lived products: designed first, versioned carefully, and governed just enough to stay consistent without becoming a bottleneck.

Engineering Effectiveness & Technical Leadership

A lot of my impact comes from raising the technical bar across teams — through design reviews, mentorship, and standards — and by building developer tooling and automation that makes the right path the easy one.

Production-first thinking

I treat observability, reliability, security, and cost as design constraints from day one. Zero-downtime deployments, blue/green and canary strategies, and operational guardrails aren’t “nice to have” — they’re part of the architecture.

🛠️ Technical depth (the tools I reach for)

Languages: Java (20+ years), Go
Application Platforms: Spring / Spring Boot, cloud-native architectures
Architecture: Microservices, event-driven systems, API platforms
Cloud & Infrastructure: Kubernetes, Docker, Helm, GitOps (ArgoCD), Terraform
Platform Concerns: Infrastructure migration & modernization, container security and runtime protection
Data: Relational-first design, PostgreSQL sharding & partitioning, MongoDB for configuration and operational data, pragmatic polyglot persistence
Streaming & Messaging: Kafka-based architectures
Observability: Metrics, tracing, logging, OpenTelemetry-based systems
AI & Governance:
- AI/ML platform integration
- Model lifecycle & accountability frameworks
- Operational observability for AI systems

📌 What this GitHub space is for

This GitHub space reflects how I think and work:

Architecture explorations and design notes
Platform, infrastructure, and AI governance experiments
Opinionated but pragmatic approaches to system and platform design
Code and documentation written with production reality in mind

I value clarity over cleverness, stability over novelty, and decisions that age well.

💬 Ask me about

Designing platforms that survive scale, re-orgs, and regulatory pressure
Domain boundaries, ownership, and DDD in real organizations
JVM performance, GC behavior, and production tuning
API strategy and governance
Responsible AI adoption and governance models
Technical leadership and architecture influence at senior levels

⚡ Perspective

Good architecture is less about frameworks and more about constraints, trade-offs, and knowing what not to build — especially when AI enters the system.