I. Prior
v. 2025.08 · Concept origin, core philosophy, and design principles.
At the heart of foresight lies a question of connection. Every new observation is a landmark in a vast, mostly unmapped territory, and our task is to understand how it re-orients our entire map of what lies ahead.
In an interconnected system, when we observe an event concerning entity at time , how should this new piece of evidence – together with what we already know – inform our belief about the future of a distant entity, , at ?
The connection is rarely a simple line of direct cause and effect or surface-level correlation. Rather, and are two nodes in a complex web of dependencies, their worldlines intertwined by causal chains, shared drivers, and feedback loops.
My old, insufficient map for this web was the simple intuition of 'context.' This project, Stratium, is driven by the conviction that we can be more precise. Here we will draw a better-calibrated map: a causal graph that serves as a principled map of dependencies, guiding how evidence from one corner of the world should reshape our probabilistic beliefs about another.
- 1.Foundation.
The world behaves causally: every observable change in the world arises from underlying mechanisms acting on prior conditions (and "noise"). At the resolution of any finite model, these causal mechanisms appear as conditional probability distributions because the model omits unobserved variables and aggregates finer-scale causes. - 2.Observation.
Many observed correlations are projections of a higher-dimensional causal process onto a smaller measured space: once latent variables are marginalized, dependence appears among observables. In practice, such correlations may arise from (i) causal paths (direct or mediated), (ii) shared drivers (backdoor/confounding), (iii) selection/collider effects, (iv) structural or measurement coupling, or (v) sampling artifacts (finite samples, nonstationarity). - 3.Prediction in an interconnected system. Our prediction target is the conditional distribution over a future variable (or a set of future variables) given all information available to us at time : , where denotes information available at .Practically, we forecast using a learned state : a (deterministic) summary of information available at time together with beliefs inferred from them (e.g., latent drivers). is chosen to preserve predictive information in and to maximize out-of-sample performance. The remaining design choice is where we fuse information to construct and produce forecasts. Two practical aggregation paradigms follow:
- 1.Late fusion ("ensemble of predictors"). Train many separate, task or signal–specific predictors on (possibly overlapping) projections of , then combine their outputs (e.g., via linear/nonlinear ensemble) near the end to form , or to predict directly. Effectively, is the stacked outputs of the separate predictors plus a meta-learner. This approach attacks the prediction problem with flexible pattern recognition.
- Pros: (i) modular & fast to ship (easy to add/retire predictors; great for building incrementally); (ii) strong in stable regimes (in stable regimes with abundant data and well-behaved signals, a well-regularized ensemble can match or exceed the performance of structural methods).
- Risks: (i) dependence/double-counting (many observed signals share confounding drivers; without modeling shared latent drivers or proper de-correlation, the meta-learner can over-believe the redundant signals); (ii) larger effective hypothesis class (experts + combiner raise variance, especially under regime shift).
- 2.Early fusion ("joint modeling"). Infer a joint global state ("world state") by assimilating observations jointly under specified mechanism and constraints, then apply target-specific readouts to for . is a shared/latent state updated by time-stamped observations along admissible mechanisms; observable and latent drivers are represented explicitly, preventing double-counting.
- Pros: (i) dependence control at the source (shared drivers are reconciled during assimilation, not at the output stage); (ii) joint latent-state inference (many true drivers are unobserved and need explicit inference; a "world state" lets you infer them inside one shared, temporally and mechanistically governed structure, updated along admissible edges); (iii) transport under regime shift (when mechanisms are correctly specified and overfitting is controlled, constraints aligned with the data-generating graph improve robustness when shifts alter marginal distributions more than the underlying mechanism; when mechanisms truly change, both paradigms require rapid schema/param adaptation); (iv) auditability (path and node attribution for explainable diagnosis).
- Risks: (i) schema misspecification (node/edge schema may be misspecified, especially when point-in-time updates fail to track real-world changes; this can suppress true signal or introduce spurious pathways); (ii) data sparsity (sparse nodes/relations can be underlearned without careful priors or param sharing; uneven observation density can skew learning); (iii) state dynamics handling (drift, decay, error accumulation, and identifiability issues can hurt calibration and interpretability).
These two paradigms are not binary but points on a spectrum defined by when and how information is fused and reconciled:Early fusion (e.g., graph assimilation) ⟶ hybrid (e.g., graph → readout ensemble) ⟶ late fusion (e.g., independent experts → meta-learner). - 1.Late fusion ("ensemble of predictors"). Train many separate, task or signal–specific predictors on (possibly overlapping) projections of , then combine their outputs (e.g., via linear/nonlinear ensemble) near the end to form , or to predict directly. Effectively, is the stacked outputs of the separate predictors plus a meta-learner. This approach attacks the prediction problem with flexible pattern recognition.
- 4.My (testable) prior: a causal graph for a partially observed world.In practice, forecasting hinges on the state . The question is how to construct so that it preserves predictive information, controls dependence, and transports across regimes – maximizing OOS performance. My prior is that early fusion — constructing via a mechanism-constrained structure — will, when correctly specified and not overfit, deliver better calibration and transport than pure late-fusion ensembles.Intuition. The world is causal but partially observed; observables are projections mixing shared drivers, selection, and measurement constraints. Hence any robust prediction model must contend with latent variables and shared drivers. A world state (early fusion) model is architecturally designed for this: it provides a central structure to model shared drivers explicitly, infer latents, and expose paths for interpretability. By integrating evidence within a mechanism-constrained structure, such a model's information flow aligns more closely with the real system's channels for propagating influences.Target taxonomy & why hybridization. Let's consider the nature of our prediction targets. The primary axis determining a target's difficulty is not its complexity alone, but the observability of its direct causal drivers:
- Structurally observable targets. When key causal parents are at least partially visible with governed lags (e.g., policy decisions, governed corporate actions), a mechanism-aligned, correctly-specified world state improves forecast calibration by explicitly modeling the data-generating process.
- Latent-dominated targets. For some events – especially reflexive targets like short-horizon returns – the most influential drivers (e.g., other players in the market) are unobserved. The outcome is an "equilibrium-compressed" result of these hidden states. A purely observed-node causal model will underperform because key drivers are latent. We hybridize early- and late-fusion: estimate a latent world state from the observed nodes/edges, then learn a set of calibrated readouts for latent-dominated targets (e.g., returns) and predict from their ensemble. Here, a robust world state can serve as the feature-generation engine for a specialized ensemble, combining structural integrity with predictive flexibility.
Structural layer. In practice, the world state requires an explicit structure that governs how information flows — which variables can influence which others, and how (e.g., via what form, on what lag). We adopt a causal graph – a network of entities (nodes) and admissible relations (edges) – for this structural layer. A causal graph renders the system's constraints explicit and testable, acting as both a regularizer to reduce variance and an inference engine to support joint latent-state estimation. The question hence becomes: How do we build a graph whose structure mirrors the true causal architecture of the world, thereby providing the correct grammar for the predictive engine's reasoning?Assumptions & terminology.- Time & process. Discrete time . We approximate the data-generating process by a dynamic structural causal model. The graph is specified as a rolled time-slice schema and analyzed in its unrolled form (a copy per time step with only forward-in-time edges), so cycles may exist across time but not within a single time slice.
- Mechanisms. For any variable , where are its true (lagged) causal parents (observed parents and latent parents; i.e., ) and is exogenous noise, independent of . This induces the conditional distribution .
- Admissible parents of are the (lagged) parents that, in the unrolled graph, d-separate from all other variables: by the causal Markov assumption (graphical separation ⇒ conditional independence), conditioning on them blocks every chain/fork path to and leaves collider paths unopened, so no other variable adds predictive information. Under causal Markov and correct temporal ordering, the minimal sufficient information for estimating is 's (lagged) parent set — the admissible parents in the data-generating graph, evaluated with data up to .
- Schema vs state. The (versioned) causal graph schema is , where is the collection of entities (nodes/vertices), and is the set of admissible relations (edges) among the nodes. The world state at time is , i.e., node states , relation/edge states , and optional global latents . represents our time- belief over both observed and latent node/edge variables. Later, we may refer to the grounded graph when emphasizing the schema populated with current values.
Causal graph vs unstructured "context." A causal graph is a structured form of context. It encodes admissible dependencies and mechanisms — who may influence whom and on what timescale — and excludes impossible ones. This introduces an inductive bias that narrows the hypothesis class for to mechanistically and temporally admissible relations.Graph resolution. A causal graph is not a claim that entities are ontologically ultimate, but a coordinate chart in a "conglomerate" world, a pragmatic partition of reality into variables and interactions at the resolution we can measure. We start coarse and operational because predictive stability comes from signals that are (i) measurable with stable labels, (ii) identifiable (high SNR, repeatable across time), and (iii) transportable under modest regime shifts. We will refine nodes/edges as we evolve the graph ontology.Deference criterion. De-prefer the explicit graph if it is predictively and informationally subsumed – i.e., if it is strictly dominated on the prediction task by another architecture (e.g., an implicit-structure model) and adds no incremental predictive power or interpretability insights.
Let's now instantiate these principles in macro-finance – where asset returns offer a stringent evaluation of the method's calibration and potential.
Hence Stratium: a predictive causal graph system for the macro-finance domain. It is a working research project – structure is a testable prior; representation and mechanism are incrementally revamped; module retention is decided by out-of-sample validation.
Operationalizing the causal graph requires three ingredients:
- 1.To describe the graph at any static snapshot: an ontology of nodes (entities) and edges (admissible relations), versioned.
- 2.To make the graph "flow": a world state representation , updated by timestamped events along those nodes and edges under the graph's information propagation mechanism.
- 3.To make the graph trainable: a prediction/evaluation loop setup that trains the graph to forecast the next (partial) world state via (1) scoring the next-edge activations (i.e., events) and selected node outcomes over fixed horizons and (2) evaluating against realizations, under strict chronological replay.
- LLMs assist in task-specific, auditable roles; each role applies multi-agent agreement selectively.
- Initial schema construction: We partition schema growth into micro-tasks (e.g., per-cohort nodes, per-pair relations). A proposer suggests ontology items (nodes, admissible relations, lags) with evidence; a critic challenges and validates.
- Point-in-time graph state update: For every , a retriever selects relevant documents from designated data sources; an extractor performs constrained decoding (deterministic JSON schema) to timestamped, structured events; a validator double checks the quality of extracted events. Validated events are assimilated to under the graph’s information-flow rules.
- Forecasting heads remain non-LLM statistical models, in order to:
- produce calibrated probabilities/quantiles under proper scoring rules;
- let us impose and audit structure (lags, monotonicity, sparsity) and run clean ablations/attribution;
- preserve point-in-time discipline by avoiding LLMs’ parametric look-ahead risk;
- stay deterministic and lightweight at prediction time – stable and easy to version-control.
Architecture blueprint in detail: representation, information flow, and training.