Node Retry Policies

puppeteeR has two independent retry mechanisms:

Mechanism	What it retries	Where it is configured
Agent-level (`max_retries`, `retry_wait`)	A single LLM API call inside `agent$chat()`	`agent()` / `Agent$new()`
Node-level (`retry_policy()`)	An entire node function execution	`add_node(..., retry = ...)`

Agent-level retries handle transient API errors automatically for every LLM call. Node-level retry policies are for when the node itself is unreliable — an external HTTP call, a database query, or any fallible operation that a node performs beyond just calling an LLM.

Creating a retry policy

retry_policy() takes three arguments:

# Retry up to 3 times (1 initial attempt + 2 retries), wait 1 second between attempts
rp <- retry_policy(max_attempts = 3L, wait_seconds = 1, backoff = 1)
rp
#> $max_attempts
#> [1] 3
#> 
#> $wait_seconds
#> [1] 1
#> 
#> $backoff
#> [1] 1
#> 
#> attr(,"class")
#> [1] "retry_policy"

Argument	Type	Default	Meaning
`max_attempts`	integer ≥ 2	`3L`	Total number of attempts including the first
`wait_seconds`	number ≥ 0	`1`	Seconds to wait before the next attempt
`backoff`	number > 0	`1`	Multiplier applied to `wait_seconds` after each failure

With backoff = 1 the wait is constant. With backoff = 2 the wait doubles each time: 1 s → 2 s → 4 s → …

Basic usage

Pass a retry_policy to add_node() via the retry argument:

# Simulate a node that fails twice then succeeds
attempt_n <- 0L

flaky_node <- function(state, config) {
  attempt_n <<- attempt_n + 1L
  if (attempt_n < 3L) stop("transient error")
  list(result = paste0("succeeded on attempt ", attempt_n))
}

schema <- workflow_state(result = list(default = ""))

runner <- state_graph(schema) |>
  add_node(
    "fetch",
    flaky_node,
    retry = retry_policy(max_attempts = 3L, wait_seconds = 0)
  ) |>
  add_edge(START, "fetch") |>
  add_edge("fetch", END) |>
  compile()

final <- runner$invoke()
final$get("result")
#> [1] "succeeded on attempt 3"
attempt_n
#> [1] 3

The node failed twice and succeeded on the third attempt. From the caller’s perspective, invoke() returned normally — the retries were invisible.

Exhausting all attempts

When all attempts fail, the runner re-throws the last error with a message that includes the node name and attempt count:

always_fail <- function(state, config) stop("permanent failure")

schema2 <- workflow_state(x = list(default = 0L))

runner2 <- state_graph(schema2) |>
  add_node(
    "broken",
    always_fail,
    retry = retry_policy(max_attempts = 3L, wait_seconds = 0)
  ) |>
  add_edge(START, "broken") |>
  add_edge("broken", END) |>
  compile()

tryCatch(
  runner2$invoke(),
  error = function(e) message(conditionMessage(e))
)
#> Node "broken" failed after 3 attempt(s).
#> ✖ permanent failure
#> Caused by error in `node_fn()`:
#> ! permanent failure

Exponential backoff

For external API calls, exponential backoff reduces load on a struggling service. Set backoff = 2 to double the wait after each failure:

runner <- state_graph(workflow_state(data = list(default = ""))) |>
  add_node(
    "call_api",
    function(state, config) {
      response <- httr2::request("https://api.example.com/data") |>
        httr2::req_perform() |>
        httr2::resp_body_string()
      list(data = response)
    },
    retry = retry_policy(max_attempts = 4L, wait_seconds = 1, backoff = 2)
    # Waits: 1s → 2s → 4s between attempts
  ) |>
  add_edge(START, "call_api") |>
  add_edge("call_api", END) |>
  compile()

Node retry vs. agent-level retry

The two mechanisms target different failure modes and compose independently:

library(ellmer)

# Agent-level: retries the LLM API call if the HTTP request fails
researcher <- agent(
  "researcher",
  chat_anthropic(),
  max_retries = 3L,    # 4 total LLM call attempts
  retry_wait  = 5      # 5 seconds between LLM call retries
)

# Node-level: retries the entire node function if it throws any error
runner <- state_graph(workflow_state(result = list(default = ""))) |>
  add_node(
    "research",
    function(state, config) {
      # This node does two things: fetch data from an external API, then call the LLM.
      # If either fails, the node-level policy retries the whole thing.
      raw <- fetch_external_data(state$get("query"))   # may fail with network errors
      result <- config$agents$researcher$chat(raw)     # may fail with API errors (caught by agent retry first)
      list(result = result)
    },
    retry = retry_policy(max_attempts = 2L, wait_seconds = 30)
  ) |>
  add_edge(START, "research") |>
  add_edge("research", END) |>
  compile(agents = list(researcher = researcher))

When the node runs:

fetch_external_data() fails → node-level retry fires after 30 s.
researcher$chat() fails → agent-level retry fires (up to 4 attempts, 5 s apart). If all agent retries fail, the error propagates out of $chat() and the node-level policy retries the whole node.

The two policies stack: agent-level retry guards individual LLM calls; node-level retry guards the entire operation including any surrounding code.

Disabling retries for a node

Omit the retry argument (or pass NULL) to run a node without any retry logic. In that case a single failure immediately propagates as an error from invoke():

state_graph(schema) |>
  add_node("fast_fail", function(state, config) stop("oops"))  # retry = NULL (default)

Combining retry policy with checkpointing

Node-level retries happen within a single invoke() call — the checkpointer does not see individual retry attempts, only the final outcome of each node. This means:

A successful retry → checkpoint is saved as if the node succeeded first time.
All retries exhausted → the error propagates; no checkpoint is written for that node.

On the next invoke() with the same thread_id, the runner resumes from the last successful checkpoint and retries the failing node from scratch.

cp <- rds_checkpointer(path = "checkpoints/")

schema <- workflow_state(
  data   = list(default = ""),
  result = list(default = "")
)

runner <- state_graph(schema) |>
  add_node("fetch", function(state, config) {
    if (nzchar(state$get("data"))) return(list())   # idempotent: skip if already fetched
    list(data = fetch_from_api())                   # may fail; checkpoint saved after success
  }) |>
  add_node(
    "process",
    function(state, config) {
      list(result = config$agents$llm$chat(state$get("data")))
    },
    retry = retry_policy(max_attempts = 3L, wait_seconds = 5, backoff = 2)
  ) |>
  add_edge(START, "fetch") |>
  add_edge("fetch", "process") |>
  add_edge("process", END) |>
  compile(agents = list(llm = my_agent), checkpointer = cp)

tryCatch(
  runner$invoke(config = list(thread_id = "job-1")),
  error = function(e) message("Failed: ", conditionMessage(e))
)

# On re-run, fetch is skipped (idempotent guard) and process is retried fresh
runner$invoke(config = list(thread_id = "job-1"))

Validation

retry_policy() validates its arguments eagerly:

retry_policy(max_attempts = 1L)    # must be >= 2
#> Error in `retry_policy()`:
#> ! `max_attempts` must be an integer >= 2.

retry_policy(wait_seconds = -1)    # must be non-negative
#> Error in `retry_policy()`:
#> ! `wait_seconds` must be a non-negative number.

retry_policy(backoff = 0)          # must be positive
#> Error in `retry_policy()`:
#> ! `backoff` must be a positive number.

Passing a non-retry_policy object to add_node() is also rejected:

schema3 <- workflow_state(x = list(default = 0L))
state_graph(schema3) |>
  add_node("n", function(state, config) list(), retry = list(max_attempts = 3L))
#> Error in `check_is_retry_policy()`:
#> ! `retry` must be a <retry_policy> object or "NULL", not a list.