Skip to content

[Bug]: attempt is incorrect in DurableLogData after lambda sandbox error #497

@rs-amp

Description

@rs-amp

Expected Behavior

DurableLogData.attempt should provide an accurate attempt number when a step retries for any reason.

Actual Behavior

When AWS retries after any Sandbox. invocation error, DurableLogData.attempt doesn't increment. The easiest example to trigger is Sandbox.Timedout, but I’ve also seen it happen with a segmentation fault.

A quick way of triggering this is using a custom logger that separates out the method to get the log data, then exposing a method to get an attempt that can be called within a step:

import type {
  DurableLogger,
  DurableLoggingContext,
} from "@aws/durable-execution-sdk-js";

// eslint-disable-next-line @typescript-eslint/no-explicit-any
type LoggerParams = any;
type DurableLogLevel = Parameters<NonNullable<DurableLogger["log"]>>[0];

export class AttemptNumberLogger implements DurableLogger {
  private context?: DurableLoggingContext;

  get attempt() {
    return this.context?.getDurableLogData().attempt ?? 1;
  }

  constructor(private baseLogger: DurableLogger) {}

  log?(level: `${DurableLogLevel}`, ...params: LoggerParams): void {
    this.baseLogger.log?.(level, ...params);
  }
  error(...params: LoggerParams): void {
    this.baseLogger.error(...params);
  }
  warn(...params: LoggerParams): void {
    this.baseLogger.warn(...params);
  }
  info(...params: LoggerParams): void {
    this.baseLogger.info(...params);
  }
  debug(...params: LoggerParams): void {
    this.baseLogger.debug(...params);
  }

  configureDurableLoggingContext?(
    durableLoggingContext: DurableLoggingContext,
  ): void {
    this.context = durableLoggingContext;

    this.baseLogger.configureDurableLoggingContext?.(durableLoggingContext);
  }
}
context.configureLogger({
  customLogger: new AttemptNumberLogger(context.logger),
});
//...
await context.step(
  'example',
  async (context) => {
    const attempt = (context.logger as AttemptNumberLogger).attempt - 1;

    // attempt is always 1
    
    while (true) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
  },
  {
    retryStrategy:  (error, attemptCount) => { /* not called */ }
  }
);

My specific use case for this is actually getting the attempt number out of the logger, and using it as part of my own progress event reporting within the step (sort of like logging, but more user facing). Each event has a unique sequence number, which gets automatically bumped when the attempt number is greater than 1 to leave room for events that fired in the last failed attempt.

This works fine when the error is properly handled within the lambda (the attempt number increments for each retry), but for any sandbox errors the attempt number is always 1, even if it has retried the step a lot of times. This means the sequence numbers in my reported events are being duplicated.

Sort of related - I also noticed that function timeouts will retry a lot of times, and with a default backoff timing strategy. It would be nice for this to surface in the next replay and follow the retry strategy, though I understand that it might not be clear that the sandbox error happened within a particular step vs outside of it.

Steps to Reproduce

  1. Register custom logger to expose attempt number (code above)
  2. Within step, check the attempt number
  3. Deliberately cause Sandbox.Timedout error, ideally without blocking the event loop as that can cause some other problems.
  4. Observe that attempt number from custom logger is always 1
  5. Optionally, change the example to throw an Error to verify that the attempt number does increase when retrying a step from a javascript error.

SDK Version

1.0.2

Node.js Version

22.x

Is this a regression?

No

Last Working Version

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions