Models — Elvin Hajizada

Embodied autonomy requires multi-time-scale learning

The dominant trajectory of AI treats learning and deployment as separate phases: pre-train on a vast corpus, freeze the weights, ship the artifact. For agents that live in static, predictable environments (chat windows, retrieval pipelines, code completion) this division has been spectacularly successful. For agents that have bodies, it has not. A drone inspecting a wind turbine, a manipulator placing an object on a cluttered shelf: neither can assume that the world at deployment looks like the world at training.

The next frontier in embodied AI is not a larger pre-trained model, but learning that happens on many time scales at once — milliseconds to years.

Designing systems around a single time scale is the structural reason embodied autonomy keeps stalling at the demo-to-deployment gap. The jump from single-time-scale to multi-time-scale learning is qualitative, not incremental.

Why current paradigms leave a gap

Foundation models, and their robotic descendants (vision-language-action models), are necessary but not sufficient for embodied autonomy. Large-scale pre-training yields priors that generalize far better than anything we knew how to build five years ago. But those priors are static, and the world a deployed agent inhabits is not. When the gap between training and deployment is small, the prior dominates and everything looks fine. When it grows (new lighting, new weather, a degraded sensor, a novel object, a slightly different gripper) the prior fails predictably, and the agent has no recourse.

So I design embodied systems from the start as multi-time-scale learners, with the pre-trained prior as one component rather than the whole system.

The four time scales

Useful learning in an embodied agent happens at four overlapping rates: millisecond-scale fast updates, seconds-scale test-time adaptation, minutes-to-hours-scale continual learning, and the days-to-years-scale consolidation we usually call continual pre-training. Test-time adaptation at the seconds scale is deployable today; consolidation at the slow end is well-trodden; fast updates are an established primitive.

The wedge: minutes-to-hours continual learning

Continual learning at minutes to hours is where deployable progress has lagged. It is the regime where an agent is online long enough to accumulate genuine new evidence, but not so long that we can afford to ship the data back to a training cluster. It is where the open problems are sharpest — catastrophic forgetting, online evaluation under non-stationarity, replay design, when to consolidate and when to forget. If this framework has a single load-bearing claim, it is that minutes-to-hours continual learning is the time scale we cannot keep punting on, and that progress here unlocks the rest of the stack.

Two anchor settings

Aerial inspection of civil infrastructure is the small-and-adaptive case: a heavy pre-trained prior is a poor fit on a power-constrained drone, so lightweight representations that adapt aggressively on-device are likely the right architecture. Manipulation on VLA-class backbones is the complementary case: the heavy prior is the right starting point, and adaptation lives on top of it. The framework does not pick one answer; it makes the question explicit.

Hardware as a first-class concern

Adaptation under real latency and energy budgets is not a deployment afterthought; it shapes which algorithms are even admissible. Co-designing the learning rule with the substrate it runs on (neuromorphic or otherwise) is how the millisecond and seconds scales become affordable at the edge at all.

More models are on the way — this section will grow as I write the others up. The full research manifesto goes deeper into each claim.