Not All Failures Deserve a Second Chance

Retry logic is one of those things that feels simple until it isn’t.

You add @Retryable, set maxAttempts = 3, and move on. Everything works — until the system happily retries a validation error three times, re-publishes a message that was never going to fit in Kafka, or quietly fails to retry something that actually was transient.

At that point, retry logic stops being resilience. It becomes noise with a delay.

This post describes a pattern we introduced to prevent that kind of misconfiguration from ever becoming a production problem: a selective retry policy that makes retry intent explicit and centrally defined.

What Spring Retry Already Gives You

Spring Retry’s SimpleRetryPolicy allows exception-class-based configuration:

Map<Class<? extends Throwable>, Boolean> retryable = new HashMap<>();
retryable.put(SocketTimeoutException.class, true);
retryable.put(ValidationException.class, false);

new SimpleRetryPolicy(3, retryable, true);

This works — but only at the class level.

In practice:

A single ApplicationException may represent multiple failure modes.
Configuration becomes scattered across services.
There is no protection against marking the same exception as both retryable and non-retryable.

Retries are not about exception structure.

They are about failure semantics.

The Core Insight: Failures Have Meaning

Broadly speaking:

Transient failures

Temporary issues — network blips, timeouts, infrastructure instability.
Retries help.

Permanent failures
Invalid input, malformed payloads, authentication failures.
Retries waste time.

Retry logic should understand that distinction.

Step One: A Typed Exception with an Error Enum

Instead of proliferating exception classes, introduce a single typed exception:

public enum ErrorType {
    VALIDATION_ERROR,
    SERVICE_CLIENT_ERROR,   // 4xx
    SERVICE_SERVER_ERROR,   // 5xx
    MESSAGING_ERROR,
    AUTHENTICATION_ERROR,
    UNKNOWN_ERROR
}

public class ApplicationException extends RuntimeException {

    private final ErrorType type;

    public ApplicationException(ErrorType type, String message) {
        super(message);
        this.type = type;
    }

    public ApplicationException(ErrorType type, String message, Throwable cause) {
        super(message, cause);
        this.type = type;
    }

    public ErrorType getType() {
        return type;
    }
}

Now the exception carries semantic intent.

The retry policy no longer needs to infer behaviour from class names or messages.

Step Two: The Selective Retry Policy

SelectiveRetryPolicy extends SimpleRetryPolicy and introduces:

retryable exception classes
non-retryable exception classes
retryable error types
non-retryable error types
strict validation against overlap

Overlap Protection

If an error type or exception class is configured as both retryable and non-retryable, the application fails at startup.

For example, this configuration is rejected:

SelectiveRetryPolicy.<ApplicationException, ErrorType>builder()
    .retryOnErrorType(ErrorType.SERVICE_SERVER_ERROR)
    .doNotRetryOnErrorType(ErrorType.SERVICE_SERVER_ERROR) // Illegal
    .build();

Misconfiguration is treated as a configuration error — not a runtime surprise. The actual policy is here

Precedence Model

The evaluation order is explicit:

1. Typed exception error type rules
   └── Explicitly non-retryable → do not retry
   └── Explicitly retryable     → retry
   └── Otherwise                → fall through

2. Exception class rules
   └── Explicitly non-retryable → do not retry
   └── Explicitly retryable     → retry
   └── Otherwise                → default

3. Default behaviour → retry

Non-retryable rules always take precedence.

If no rule matches, the policy defaults to retry.

This is intentional and can be flipped if a fail-safe approach is preferred.

Usage Example 1 — REST Client

Define the retry template once:

@Bean("serviceRetryTemplate")
public RetryTemplate serviceRetryTemplate() {

    var policy = SelectiveRetryPolicy.<ApplicationException, ErrorType>builder()
        .maxAttempts(3)
        .typedExceptionClass(ApplicationException.class)
        .errorTypeExtractor(ApplicationException::getType)
        .retryOnException(SocketTimeoutException.class)
        .doNotRetryOnErrorType(ErrorType.VALIDATION_ERROR)
        .doNotRetryOnErrorType(ErrorType.SERVICE_CLIENT_ERROR)
        .build();

    var backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(1000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(10000L);

    RetryTemplate template = new RetryTemplate();
    template.setRetryPolicy(policy);
    template.setBackOffPolicy(backOff);
    return template;
}

Option A — Explicit RetryTemplate

@Service
public class TrackingServiceClient {

    private final RestTemplate restTemplate;
    private final RetryTemplate retryTemplate;

    public TrackingDTO getTracking(String shipmentId) {
        return retryTemplate.execute(ctx ->
            restTemplate.getForObject("/tracking/{id}", TrackingDTO.class, shipmentId)
        );
    }
}

This makes retry behaviour visible at the call site.

Option B — Declarative via @Retryable

@Bean("selectiveRetryInterceptor")
public RetryOperationsInterceptor selectiveRetryInterceptor() {
    RetryOperationsInterceptor interceptor = new RetryOperationsInterceptor();
    interceptor.setRetryOperations(serviceRetryTemplate());
    return interceptor;
}

@Service
public class TrackingServiceClient {

    private final RestTemplate restTemplate;

    @Retryable(interceptor = "selectiveRetryInterceptor")
    public TrackingDTO getTracking(String shipmentId) {
        return restTemplate.getForObject("/tracking/{id}", TrackingDTO.class, shipmentId);
    }
}

The retry contract remains centralised.

The annotation simply references it.

Usage Example 2 — Kafka Producer

Kafka introduces its own mix of transient and permanent failures.

@Bean("kafkaRetryTemplate")
public RetryTemplate kafkaRetryTemplate() {

    var policy = SelectiveRetryPolicy.<ApplicationException, ErrorType>builder()
        .maxAttempts(3)
        .typedExceptionClass(ApplicationException.class)
        .errorTypeExtractor(ApplicationException::getType)
        .retryOnException(KafkaException.class)
        .retryOnException(NetworkException.class)
        .retryOnException(RetriableException.class)
        .doNotRetryOnException(SerializationException.class)
        .doNotRetryOnException(RecordTooLargeException.class)
        .doNotRetryOnException(InvalidTopicException.class)
        .doNotRetryOnErrorType(ErrorType.VALIDATION_ERROR)
        .build();

    var backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(2000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(30000L);

    RetryTemplate template = new RetryTemplate();
    template.setRetryPolicy(policy);
    template.setBackOffPolicy(backOff);
    return template;
}

Apply it transparently using a proxy:

@Bean
public BeanNameAutoProxyCreator notificationProducerProxy() {
    BeanNameAutoProxyCreator creator = new BeanNameAutoProxyCreator();
    creator.setBeanNames("notificationsProducer");
    creator.setInterceptorNames("kafkaRetryInterceptor");
    return creator;
}

Producer remains clean:

@Service("notificationsProducer")
public class NotificationsProducer {

    private final KafkaTemplate<String, Object> kafkaTemplate;

    public void publish(String topic, Object message) {
        kafkaTemplate.send(topic, message);
    }
}

Why This Exists (Preventative Design)

The motivation here was not a dramatic outage.

It was to prevent two quiet failure modes from creeping in over time:

Missed retries — transient failures wrapped in generic exceptions that never get retried.
Over-retries — permanent failures retried repeatedly because the policy cannot distinguish them.

Both come from retry logic that lacks semantic awareness.

This pattern removes that ambiguity.

The Trade-off

This approach shifts responsibility to the throw site.

If an error is classified incorrectly, retry behaviour will also be incorrect.

That is deliberate.

Inferring retry intent from HTTP codes or exception messages tends to become brittle and distributed.

Encoding intent explicitly at the throw site keeps retry behaviour centralised and predictable.

Closing Thoughts

Retries are not a technical detail.

They are part of system behaviour.

A retry policy that does not understand failure semantics is not resilience.

It is optimism with overhead.

Spring Retry provides the foundation.

Adding a semantic layer on top makes retry behaviour readable, auditable, and difficult to misconfigure.

In a multi-service codebase, that clarity is usually the difference between retry logic you trust and retry logic you tolerate.

Not All Failures Deserve a Second Chance

What Spring Retry Already Gives You

The Core Insight: Failures Have Meaning

Step One: A Typed Exception with an Error Enum

Step Two: The Selective Retry Policy

Overlap Protection

Precedence Model

Usage Example 1 — REST Client

Usage Example 2 — Kafka Producer

Why This Exists (Preventative Design)

The Trade-off

Closing Thoughts

Comments

Engineering In the Trenches

Resilience

More from this blog

Pulling the Sword Is No Longer the Test

How I Lost a Credit Card Trying to Get One

What Zero Doesn’t Tell You

API Maturity in the Age of AI

When Consumers Fail: Extending Selective Retry to Kafka

Command Palette

What Spring Retry Already Gives You

The Core Insight: Failures Have Meaning

Step One: A Typed Exception with an Error Enum

Step Two: The Selective Retry Policy

Overlap Protection

Precedence Model

Usage Example 1 — REST Client

Usage Example 2 — Kafka Producer

Why This Exists (Preventative Design)

The Trade-off

Closing Thoughts

Comments

Engineering In the Trenches

Resilience

More from this blog