Essay

Embodied Autonomy Requires Multi-Time-Scale Learning

A research manifesto: Multi-Level Machine Learning, (ML)², as a working name

Elvin Hajizada · [email protected]


The claim

The dominant trajectory of AI research treats learning and deployment as separate phases: pre-train on a vast corpus, freeze the weights, ship the artifact. For agents that live in static, predictable environments (chat windows, retrieval pipelines, code completion), this division has been spectacularly successful. For agents that have bodies, it has not. A drone inspecting a wind turbine, a manipulator placing an object on a cluttered shelf: neither system can assume that the world at deployment looks like the world at training. The mismatch is not a footnote. On a benchmark like Cityscapes → Foggy Cityscapes [1], [2], an oracle trained on the target distribution reaches 43.1 mAP; a source-only model collapses to 25.2 [3]. That eighteen-point gap is the everyday cost of treating learning as a one-time event.

This article argues that the next frontier in embodied AI is not larger pre-trained models but learning that occurs on multiple time scales simultaneously: millisecond-scale fast reactive updates, seconds-scale test-time adaptation, minutes-to-hours-scale continual learning, and the days-to-years-scale consolidation we usually call continual pre-training. Designing systems around a single time scale is the structural reason embodied autonomy keeps stalling at the demo-to-deployment gap. The jump from single-time-scale to multi-time-scale learning is qualitative, not incremental.

Why current paradigms leave a gap

Foundation models, and their robotic descendants (vision-language-action, or VLA, models), are necessary but not sufficient for embodied autonomy. The case in their favor is well-known and I will not relitigate it: large-scale pre-training yields priors that generalize far better than anything we knew how to build five years ago. The case against treating them as the whole answer is also straightforward. Pre-trained priors are static. The world a deployed agent inhabits is not. When the gap between training distribution and deployment distribution is small, the prior dominates and everything looks fine. When the gap grows (new lighting, new weather, a degraded sensor, a novel object, a slightly different gripper), the prior fails predictably and the agent has no recourse [4].

The thesis here is that we should stop accepting this as the default. Embodied systems should be designed from the start as multi-time-scale learners, with the pre-trained prior as one component rather than the whole system. In some settings (manipulation built on top of an VLA backbone [5]) this looks complementary: the VLA provides the prior, adaptation lives on top. In others (low-SWAP perception at the edge, drones in degraded weather, robotics in tight power envelopes) the prior is heavy and the right architecture may be largely orthogonal: small representations that adapt aggressively, supported only sparingly by an off-board large model. Note that this framework does not pick a single answer; rather it makes the research question explicit.

The wedge: minutes-to-hours continual learning

Four learning time scales for embodied agents|668 Figure 1. Four learning time scales for embodied agents. Test-time adaptation at the seconds scale is deployable today; pre-training and consolidation at the days-to-years scale is well-trodden; fast updates at the millisecond scale are an established primitive. Continual learning at minutes to hours is where deployable progress has lagged.

Among the four time scales, the most mature is test-time adaptation at the seconds scale. With BatchNorm-statistics updates and a growing ecosystem of TTA/TTT methods (canonically Tent [6], with extensions to non-stationary deployment streams [7]), deployable wins are achievable today, and this is where any near-term commercial story should anchor. But TTA is only the start.

The wedge for the next decade is the time scale above: continual learning at minutes to hours. This is the regime where an agent is online long enough to accumulate genuine new evidence, but not so long that we can afford to ship the data back to a training cluster. It is the regime where the open problems are sharpest (catastrophic forgetting, online evaluation under non-stationarity, replay design, when to consolidate and when to forget) and where, despite roughly a decade of effort since the modern CL literature took shape with LwF and EWC, deployable progress has lagged both TTA at the seconds scale and offline retraining cycles at the days-to-weeks scale. If this article has a single main claim, it is that minutes-to-hours continual learning is the time scale we cannot afford to keep avoiding, and that progress here will unlock the rest of the stack.

Two anchor settings

The wedge above is a claim about where progress will pay off most. Two settings make the strongest near-term case for it, for different structural reasons that map onto the dichotomy raised earlier.

Manipulation, anchored on VLA and World Model backbones. This is the complementary side of the dichotomy: a heavy pretrained prior is the right starting point, and adaptation lives on top of it. LIBERO [8] and its successors have become the de facto testbed for embodied AI in 2026. The promise of VLA models is real; their failure modes (novel object geometries, unfamiliar grippers, distribution shifts in scene clutter) are also real and reproducible, with recent benchmarks quantifying 30–50% single-axis accuracy degradation and over 75% under compound perturbations [9]. Test-time and continual adaptation on top of these backbones is currently underdeveloped relative to its commercial value, and the gap is closable with the right algorithmic-and-hardware design.

Aerial inspection of civil infrastructure. Energy grids, agriculture, pipelines, bridges, wind turbines, telecom towers are all critical infrastructure that require persistent inspection. This is the small-and-adaptive side of the dichotomy: a heavy pretrained prior is a poor fit on a SWAP-constrained platform, and lightweight representations that adapt aggressively on-device are likely the right architecture. These platforms are deployed in hours-long missions and routinely face the exact distribution shifts that break frozen perception: fog, dusk, dust, snow, novel defect classes, sensor drift. The market is large, growing, and ethically defensible. No publicly documented production drone-inspection system currently performs meaningful on-device adaptation; existing operators rely on offline retraining cycles and cloud-side model updates. The gap between what those stacks deliver and what these missions actually need is wide and quantifiable.

Two further settings extend the framework but are not where its case is best made first: autonomous-driving perception, where the strongest TTA literature lives, and humanoids, where the funding and hype currently is, as of the writing of this article.

Hardware co-design as a first-class concern

Minutes-to-hours continual learning is hardware-bound, not just algorithm-bound. Standard GPU training pipelines are optimized for batched, offline updates; running gradient steps at deployment, on a 25-watt edge accelerator, hitting millisecond latencies, is a structurally different problem. Recent measurements confirm the gap is large: even simple last-layer continual updates on edge-GPU baselines run substantially slower and consume substantially more energy than what SWAP-constrained robotic platforms can absorb at deployment rates [10]. Efficient designs within the conventional-GPU regime do exist: HAMLET, for example, reaches segmentation accuracy comparable to CoTTA at ~13× lower compute and ~50× higher framerate by choosing what to update and when [11]. But these are improvements within the regime, not a change of regime. Hardware-algorithm co-design has to be a first-class concern, not an afterthought.

What deployable continual adaptation actually requires is a small, identifiable set of algorithmic primitives: (1) local credit assignment rather than end-to-end backprop, (2) spatiotemporally sparse computation in both activations and weight updates, (3) capacity-controlled mechanisms for lifelong growth such as neurogenesis, (4) plasticity regularization that controls which weights update, when they update, and by how much, and (5) modular representations that admit targeted, localized updates rather than diffuse ones. These primitives are biologically motivated, but the case for them is hardware-economic. They shorten credit-assignment depth, eliminate the need to store full activation maps for backward passes, admit asynchronous and event-driven execution, and map onto memory-and-compute layouts that avoid the cost of repeated long-range memory traffic.

Concrete demonstrations of this co-design already exist. In one of my works [10], an implementation built around these primitives and deployed on neuromorphic hardware (Loihi 2) matched or exceeded the accuracy of replay and non-replay continual-learning baselines on the same benchmark while delivering roughly two orders of magnitude lower per-update latency and four orders of magnitude lower per-update energy than the edge-GPU baseline. HAMLET-style efficiency gains within the conventional-GPU regime [11] are not a substitute for this change of regime; they are a complement to it. The conclusion is not that neuromorphic chips will win. It should not be also interpreted as, this program is predicated on their winning. The conclusion is that the algorithmic regime (local, sparse, event-driven, low-precision) is what makes on-device continual learning feasible at scale, and the substrate is then chosen empirically from a widening field: near-memory dataflow accelerators (Tenstorrent, Axelera), in-memory compute fabrics, and neuromorphic platforms (Loihi-class, SpiNNaker), or even edge GPUs with appropriately co-designed kernels and operator libraries.

Closed regime, open substrate|697 Figure 2. Closed algorithmic regime over open substrate. The primitives needed to make on-device continual adaptation feasible are fixed by deployment economics. The hardware substrate that hosts them is chosen empirically from a widening field of edge accelerators.

Safety, drift, and the governance of online adaptation

A system that learns at deployment is a system whose behavior is no longer fully specified by its training pipeline. This is a real risk, and it is one of the reason that mature industrial users (operators of safety-critical drones, automotive perception teams, medical robotics groups) have been slow to adopt online learning regardless of the algorithmic gains it could deliver. Governance has to be treated as a design constraint, not a compliance afterthought. Three primitives are necessary:

  1. Bounded adaptation envelopes: formal, auditable limits on which parameters can move, by how much, and over what time window, with versioned snapshots that allow rollback to a known-good state.
  2. Safe replay: the rehearsal pools that prevent catastrophic forgetting must themselves be auditable and resistant to adversarial contamination. A drifting replay buffer is a slow-motion failure mode.
  3. Asymmetric trust between perception and action: perception adaptation can in most settings be allowed to run with relatively loose constraints; policy adaptation should be gated by far stricter rules and, where stakes are high, by explicit human approval.

No deployment story for online learning in safety-critical embodied systems is credible without these. They are also not yet standard practice, and that gap is itself a research opportunity.

What this framework does not yet solve

Four open problems are worth naming honestly.

The consolidation problem. We have no theory of when an online update should be promoted from volatile short-term memory to permanent weights, only heuristics about replay frequency and learning-rate schedules. The biological analogues (sleep-driven consolidation, fast-vs-slow synapse dynamics) are suggestive but not yet operational; published two-system fast/slow CL architectures (e.g., DualNet [12]) point in the right direction but stop short of the four-time-scale view.

Online evaluation under non-stationarity. When the world is changing, how do we know whether the model is improving or drifting? Held-out test sets fail by construction. The right answer probably involves continuously sampled probes and uncertainty-based triggers, but the field has not converged.

Credit-assignment depth. Last-layer updates are cheap and well-understood. Pushing updates deeper, when the output-layer fix is insufficient, without destabilizing what came before, remains open. It is where local learning rules, modular representations, and meta-learning for continual learning (e.g., OML [13]) have to meet, and none of those communities have closed the loop yet.

The human-in-the-loop interface. Most plausible near-term deployments involve a human supervisor approving or correcting on-device updates. We have essentially no good UX for this, and no good evaluation of how much human attention each layer of adaptation actually demands.

Closing

The shortest version of the thesis: real-world AI will need to learn always. The framework is general; the wedge is continual learning at the minutes-to-hours scale; the substrate is whatever edge accelerator the application supports.


References

[1] M. Cordts et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in CVPR, 2016.

[2] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic Foggy Scene Understanding with Synthetic Data,” IJCV, vol. 126, pp. 973–992, 2018.

[3] V. VS, P. Oza, and V. M. Patel, “Towards Online Domain Adaptive Object Detection,” in WACV, 2023, pp. 478–488.

[4] D. Hendrycks and T. Dietterich, “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations,” in ICLR, 2019.

[5] M. J. Kim et al., “OpenVLA: An Open-Source Vision-Language-Action Model,” in CoRL, 2024.

[6] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully Test-Time Adaptation by Entropy Minimization,” in ICLR, 2021.

[7] Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual Test-Time Domain Adaptation (CoTTA),” in CVPR, 2022.

[8] B. Liu et al., “LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning,” in NeurIPS Datasets and Benchmarks Track, 2023.

[9] W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation,” in RSS, 2024.

[10] E. Hajizada et al., “Online Continual Learning on Intel Loihi 2 via a Co-designed Spiking Neural Network,” arXiv:2511.01553, 2025.

[11] M. Colomer, P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano, “To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation (HAMLET),” in ICCV, 2023.

[12] Q. Pham, C. Liu, and S. C. H. Hoi, “DualNet: Continual Learning, Fast and Slow,” in NeurIPS, 2021.

[13] K. Javed and M. White, “Meta-Learning Representations for Continual Learning,” in NeurIPS, 2019.

← All writing