Week of May 10 – May 17, 2026
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.
Primary: Not provided in paper text
All Institutions: Not provided in paper text
EntityBench has the potential to become a foundational benchmark for multi-shot video generation, similar to how VBench became a standard for single-shot generation. By providing a standardized, large-scale, and rigorously evaluated dataset and metric suite, it will accelerate research in automated storytelling, previsualization, and long-form content creation. The insights into the "structural disagreement" between embedding-based and LLM-based consistency metrics are crucial for guiding future metric development. EntityMem's success in explicit per-entity memory management opens new avenues for improving long-range consistency in generative models, moving beyond implicit architectural solutions. This work directly addresses a critical challenge for making video generation models practically useful for narrative content. This paper introduces EntityBench, a comprehensive benchmark for evaluating entity consistency in multi-shot video generation, paired with a rigorous three-pillar evaluation framework and a strong baseline method, EntityMem. The work significantly advances the field by providing a much-needed standardized tool for diagnosing and improving long-range entity consistency, a critical challenge for generating coherent visual narratives, and offers valuable insights into the limitations of current evaluation metrics.
The paper introduces EntityBench, a meticulously constructed benchmark and evaluation framework, alongside EntityMem, a memory-augmented generation system. EntityBench's data construction is highly sophisticated, starting from real narrative media and employing a multi-stage LLM-based refinement pipeline for entity extraction, linking, script refinement, and verification. This ensures natural entity dynamics and complex narrative structures, overcoming limitations of purely LLM-generated prompts. The explicit per-shot entity schedules for characters, objects, and locations across easy/medium/hard tiers (up to 50 shots, 48-shot recurrence gaps) represent a significant leap in benchmark complexity and realism. The three-pillar evaluation framework is exceptionally rigorous. Pillar 1 assesses intra-shot quality using established metrics (e.g., VBench-inspired). Pillar 2 focuses on intra-shot prompt-following alignment, using GroundingDINO for entity localization and a multimodal LLM (Gemini-2.5-Pro) for per-entity fidelity and action fidelity, with type-specific criteria. Pillar 3 tackles cross-shot consistency using both DINOv2 embedding similarity and LLM pairwise judgment, crucially introducing a "fidelity gate" that only admits accurately rendered entities into cross-shot scoring. This gate is a critical methodological innovation, preventing methods from achieving high consistency scores on incorrect or static renderings. The human validation study for LLM judgment further strengthens the credibility of the evaluation. EntityMem, the proposed baseline system, is a well-designed, explicit approach to entity consistency. It operates in three stages: (1) VLM-based agents generate, select, and verify per-entity visual references (portraits, panoramas) before video generation, storing them in a persistent memory bank. (2) A Layout Agent composes keyframes based on narrative action, entity schedules, and previous shot context. (3) A memory-augmented video backbone retrieves these curated references, disentangling entity identity from scene context. This pre-generation verification and explicit memory management is a clear departure from implicit consistency methods.
The experimental evaluation is comprehensive, using the full EntityBench benchmark (2,491 shots across 140 episodes) to evaluate three representative open-source SOTA methods: HoloCine (holistic), CineTrans (holistic), and StoryMem (two-stage keyframe-then-animate), alongside their proposed EntityMem. The evaluation reports all 51 metrics across the three pillars, with fidelity-gate-corrected aggregation. Key findings are impactful: 1. **EntityMem's Dominance**: EntityMem significantly outperforms baselines on entity-centric prompt-following (Pillar 2), especially for characters (e.g., face_fidelity 0.740 vs. StoryMem's 0.452, character presence 0.967 vs. HoloCine's 0.882). This validates the explicit per-entity memory approach. 2. **Embedding vs. LLM Disagreement**: Pillar 3 reveals a crucial "structural disagreement" between embedding-based metrics (DINOv2) and LLM identity judgment. While StoryMem leads on DINOv2 cosine similarity for faces and objects, EntityMem dominates LLM-judged identity metrics (e.g., llm_face_accuracy 0.406 vs. StoryMem's 0.226). This highlights that high embedding similarity does not always equate to human-perceived identity preservation, emphasizing the value of LLM-based evaluation. 3. **Quality vs. Consistency Trade-off**: Holistic methods (CineTrans, HoloCine) often achieve higher visual quality (Pillar 1 VBench metrics) but struggle with entity consistency, demonstrating that these are distinct challenges. EntityMem prioritizes identifiable and prompt-aligned entities, even if not leading on all general quality metrics. 4. **Limitations of EntityMem**: EntityMem shows a regression in object fidelity compared to StoryMem, suggesting that its current memory management might be less effective for scene-bound objects vs. character-attached items, or that its base model has integration challenges. This is a valuable diagnostic finding. The scale of the evaluation and the depth of analysis across 51 metrics provide a robust and nuanced understanding of current multi-shot video generation capabilities and their limitations.
The paper demonstrates a strong commitment to reproducibility. The authors explicitly state that "Code and data are available at https://github.com/Catherine-R-He/EntityBench/." The appendix provides extensive details on benchmark statistics, evaluation metrics (formal definitions, grounding parameters, LLM prompts), and pipeline details for EntityMem. The human validation study for LLM judgment adds transparency to the evaluation process. The explicit mention of computational resources (two nodes with 8 NVIDIA L20 GPUs) provides context for the scale of experiments. The detailed breakdown of the data construction process, including LLM agents and tools used, further aids reproducibility.
1. **Object Fidelity Regression**: EntityMem shows a notable drop in object fidelity compared to StoryMem, indicating that its per-entity memory approach might not generalize equally well to all entity types or contexts (e.g., scene-bound objects vs. characters). 2. **Computational Cost**: The benchmark is large (2,491 shots), and the evaluation suite, especially with multimodal LLM judgments and multiple baselines, is computationally intensive. This might limit widespread adoption for rapid iteration, though it's necessary for rigorous evaluation. 3. **LLM Reliance**: While human-validated, the reliance on LLMs for critical judgment tasks (per-entity fidelity, cross-shot identity) introduces a dependency on LLM capabilities and potential biases, even with validation. 4. **Placeholder Venue**: The mention of "iclr2026_conference" suggests the paper is currently an arXiv preprint and has not yet undergone full peer review at a major conference, which is a minor limitation for a comprehensive evaluation. 5. **Institution Information**: The paper text does not contain author affiliations, which is unusual and prevents assessment of the institutional backing of the research.
EntityBench has the potential to become a foundational benchmark for multi-shot video generation, similar to how VBench became a standard for single-shot generation. By providing a standardized, large-scale, and rigorously evaluated dataset and metric suite, it will accelerate research in automated storytelling, previsualization, and long-form content creation. The insights into the "structural disagreement" between embedding-based and LLM-based consistency metrics are crucial for guiding future metric development. EntityMem's success in explicit per-entity memory management opens new avenues for improving long-range consistency in generative models, moving beyond implicit architectural solutions. This work directly addresses a critical challenge for making video generation models practically useful for narrative content. This paper introduces EntityBench, a comprehensive benchmark for evaluating entity consistency in multi-shot video generation, paired with a rigorous three-pillar evaluation framework and a strong baseline method, EntityMem. The work significantly advances the field by providing a much-needed standardized tool for diagnosing and improving long-range entity consistency, a critical challenge for generating coherent visual narratives, and offers valuable insights into the limitations of current evaluation metrics.
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, University of Hong Kong, Columbia University
VECA introduces an important building block for Vision Transformers that addresses the critical issue of quadratic computational scaling, making ViTs more practical for high-resolution imagery and real-time applications. The concept of elastic inference, enabled by nested training, is particularly impactful as it allows dynamic trade-offs between compute and accuracy, which is highly valuable for deployment on diverse hardware and latency constraints. This work challenges the fundamental assumption that direct pairwise token interactions are necessary for rich visual representations, potentially opening new avenues for designing efficient attention mechanisms. Its strong performance across classification, segmentation, and detection suggests broad applicability, potentially accelerating the adoption of ViTs in domains like medical imaging, autonomous driving, and high-resolution video analysis where efficiency is paramount. VECA introduces an elastic core-periphery attention mechanism that achieves linear complexity for Vision Transformers, demonstrating competitive performance across diverse vision tasks while significantly improving computational efficiency and enabling flexible compute-accuracy trade-offs. This paper presents a well-motivated and empirically strong architectural innovation that addresses a critical scalability bottleneck in Vision Transformers, making them more practical for high-resolution applications and offering a valuable elastic inference capability for real-world deployment.
The paper proposes Visual Elastic Core Attention (VECA), an innovative Vision Transformer architecture designed to overcome the quadratic scaling limitations of traditional self-attention. The core idea is to replace direct all-to-all patch interactions with an indirect communication mechanism mediated by a small, fixed set of learned "core" tokens. Specifically, the VECA block introduces Core-Periphery Attention (CPA), where patch tokens interact only with core tokens (Patch-to-Core attention), and core tokens interact with patch tokens (Core-to-Patch attention). Crucially, the core tokens are not derived from the input patches at each layer but are learned from scratch and propagated across layers, acting as a persistent communication interface. This design yields linear computational complexity $O(N \cdot C \cdot D)$ with respect to the number of patches $N$ (for fixed core count $C$ and dimension $D$), making it highly scalable for high-resolution images. A significant methodological contribution is the "nested training along the core axis," which allows the model to be trained with multiple core counts simultaneously. This enables elastic inference, where the number of active core tokens can be adjusted at test time to trade off compute for accuracy, a highly practical feature. The architecture is well-motivated and clearly described, building on ideas from Perceiver-like models but distinguishing itself by maintaining and iteratively updating the full set of input tokens, avoiding a bottleneck.
The experimental evaluation is comprehensive and rigorous, covering standard vision tasks: ImageNet-1K classification, ADE20K semantic segmentation, and COCO object detection/instance segmentation. VECA models (Tiny, Small, Base) are compared against strong baselines including DeiT, Swin Transformer, ConvNeXt, PVT, CoAtNet, and Perceiver. For ImageNet-1K classification, VECA-Base achieves 83.6% top-1 accuracy, competitive with Swin-B (83.3%) and ConvNeXt-B (83.8%), while demonstrating superior throughput and often lower FLOPs, especially when considering higher resolutions. On dense prediction tasks, VECA-Base integrated into UperNet for ADE20K segmentation achieves 49.6 mIoU, matching Swin-B (49.5) and ConvNeXt-B (49.9). For COCO object detection/instance segmentation with Mask R-CNN, VECA-Base achieves 49.0 box AP / 42.9 mask AP, again competitive with Swin-B (49.0/42.8) and ConvNeXt-B (49.6/43.1). The results consistently show that VECA can achieve state-of-the-art performance while significantly improving computational efficiency and scalability. Ablation studies thoroughly validate key design choices, including the impact of core count, core initialization, core propagation, and the effectiveness of nested training for elastic inference. Visualizations of core attention further provide insights into how cores learn to attend to different semantic regions.
The paper provides sufficient architectural details, training configurations, and hyperparameters in the main text and appendix for the core components of VECA. Standard datasets and established frameworks (UperNet, Mask R-CNN) are used for dense tasks. While no explicit code repository URL is provided in the paper, the level of detail suggests that a diligent researcher should be able to reproduce the main results. The use of common benchmarks and clear descriptions of the methodology contribute positively to reproducibility.
The paper does not explicitly list limitations. One potential limitation is that while the core tokens are learned and propagated, the fixed number of cores ($C$) might still represent a bottleneck for extremely complex scenes or tasks requiring very fine-grained global interactions, although the paper demonstrates strong performance across various tasks. The "no direct patch-to-patch interaction" claim, while empirically supported, implies an indirect interaction through the cores, which still allows for information flow across patches. The optimal choice of $C$ for different tasks and resolutions might require some tuning, although the nested training helps mitigate this by providing flexibility. While efficient, the overall complexity is still $O(NCD)$, which is linear in $N$ but still depends on $C$ and $D$.
VECA introduces an important building block for Vision Transformers that addresses the critical issue of quadratic computational scaling, making ViTs more practical for high-resolution imagery and real-time applications. The concept of elastic inference, enabled by nested training, is particularly impactful as it allows dynamic trade-offs between compute and accuracy, which is highly valuable for deployment on diverse hardware and latency constraints. This work challenges the fundamental assumption that direct pairwise token interactions are necessary for rich visual representations, potentially opening new avenues for designing efficient attention mechanisms. Its strong performance across classification, segmentation, and detection suggests broad applicability, potentially accelerating the adoption of ViTs in domains like medical imaging, autonomous driving, and high-resolution video analysis where efficiency is paramount. VECA introduces an elastic core-periphery attention mechanism that achieves linear complexity for Vision Transformers, demonstrating competitive performance across diverse vision tasks while significantly improving computational efficiency and enabling flexible compute-accuracy trade-offs. This paper presents a well-motivated and empirically strong architectural innovation that addresses a critical scalability bottleneck in Vision Transformers, making them more practical for high-resolution applications and offering a valuable elastic inference capability for real-world deployment.
In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.
This paper introduces the concept of spontaneous symmetry breaking and Goldstone modes as a mechanism for stable, coherent information propagation in deep neural networks, offering an alternative to traditional architectural stabilizers. The work presents a highly novel theoretical framework, drawing a deep analogy from physics to address the fundamental challenge of information flow in deep and recurrent networks. By proposing that Goldstone-like degrees of freedom can enable stable signal propagation without relying on established techniques like residual connections or normalization, it offers a potentially transformative new design principle for neural architectures. The analytical and empirical demonstrations, if robust, suggest a significant technical contribution that could lead to improved trainability, representational diversity, and long-term memory capabilities, thereby influencing how practitioners approach the design of deep learning models.
In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.
This paper introduces the concept of spontaneous symmetry breaking and Goldstone modes as a mechanism for stable, coherent information propagation in deep neural networks, offering an alternative to traditional architectural stabilizers. The work presents a highly novel theoretical framework, drawing a deep analogy from physics to address the fundamental challenge of information flow in deep and recurrent networks. By proposing that Goldstone-like degrees of freedom can enable stable signal propagation without relying on established techniques like residual connections or normalization, it offers a potentially transformative new design principle for neural architectures. The analytical and empirical demonstrations, if robust, suggest a significant technical contribution that could lead to improved trainability, representational diversity, and long-term memory capabilities, thereby influencing how practitioners approach the design of deep learning models.
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.
Primary: Not provided in paper text
All Institutions: Not provided in paper text
EntityBench has the potential to become a foundational benchmark for multi-shot video generation, similar to how VBench became a standard for single-shot generation. By providing a standardized, large-scale, and rigorously evaluated dataset and metric suite, it will accelerate research in automated storytelling, previsualization, and long-form content creation. The insights into the "structural disagreement" between embedding-based and LLM-based consistency metrics are crucial for guiding future metric development. EntityMem's success in explicit per-entity memory management opens new avenues for improving long-range consistency in generative models, moving beyond implicit architectural solutions. This work directly addresses a critical challenge for making video generation models practically useful for narrative content. This paper introduces EntityBench, a comprehensive benchmark for evaluating entity consistency in multi-shot video generation, paired with a rigorous three-pillar evaluation framework and a strong baseline method, EntityMem. The work significantly advances the field by providing a much-needed standardized tool for diagnosing and improving long-range entity consistency, a critical challenge for generating coherent visual narratives, and offers valuable insights into the limitations of current evaluation metrics.
The paper introduces EntityBench, a meticulously constructed benchmark and evaluation framework, alongside EntityMem, a memory-augmented generation system. EntityBench's data construction is highly sophisticated, starting from real narrative media and employing a multi-stage LLM-based refinement pipeline for entity extraction, linking, script refinement, and verification. This ensures natural entity dynamics and complex narrative structures, overcoming limitations of purely LLM-generated prompts. The explicit per-shot entity schedules for characters, objects, and locations across easy/medium/hard tiers (up to 50 shots, 48-shot recurrence gaps) represent a significant leap in benchmark complexity and realism. The three-pillar evaluation framework is exceptionally rigorous. Pillar 1 assesses intra-shot quality using established metrics (e.g., VBench-inspired). Pillar 2 focuses on intra-shot prompt-following alignment, using GroundingDINO for entity localization and a multimodal LLM (Gemini-2.5-Pro) for per-entity fidelity and action fidelity, with type-specific criteria. Pillar 3 tackles cross-shot consistency using both DINOv2 embedding similarity and LLM pairwise judgment, crucially introducing a "fidelity gate" that only admits accurately rendered entities into cross-shot scoring. This gate is a critical methodological innovation, preventing methods from achieving high consistency scores on incorrect or static renderings. The human validation study for LLM judgment further strengthens the credibility of the evaluation. EntityMem, the proposed baseline system, is a well-designed, explicit approach to entity consistency. It operates in three stages: (1) VLM-based agents generate, select, and verify per-entity visual references (portraits, panoramas) before video generation, storing them in a persistent memory bank. (2) A Layout Agent composes keyframes based on narrative action, entity schedules, and previous shot context. (3) A memory-augmented video backbone retrieves these curated references, disentangling entity identity from scene context. This pre-generation verification and explicit memory management is a clear departure from implicit consistency methods.
The experimental evaluation is comprehensive, using the full EntityBench benchmark (2,491 shots across 140 episodes) to evaluate three representative open-source SOTA methods: HoloCine (holistic), CineTrans (holistic), and StoryMem (two-stage keyframe-then-animate), alongside their proposed EntityMem. The evaluation reports all 51 metrics across the three pillars, with fidelity-gate-corrected aggregation. Key findings are impactful: 1. **EntityMem's Dominance**: EntityMem significantly outperforms baselines on entity-centric prompt-following (Pillar 2), especially for characters (e.g., face_fidelity 0.740 vs. StoryMem's 0.452, character presence 0.967 vs. HoloCine's 0.882). This validates the explicit per-entity memory approach. 2. **Embedding vs. LLM Disagreement**: Pillar 3 reveals a crucial "structural disagreement" between embedding-based metrics (DINOv2) and LLM identity judgment. While StoryMem leads on DINOv2 cosine similarity for faces and objects, EntityMem dominates LLM-judged identity metrics (e.g., llm_face_accuracy 0.406 vs. StoryMem's 0.226). This highlights that high embedding similarity does not always equate to human-perceived identity preservation, emphasizing the value of LLM-based evaluation. 3. **Quality vs. Consistency Trade-off**: Holistic methods (CineTrans, HoloCine) often achieve higher visual quality (Pillar 1 VBench metrics) but struggle with entity consistency, demonstrating that these are distinct challenges. EntityMem prioritizes identifiable and prompt-aligned entities, even if not leading on all general quality metrics. 4. **Limitations of EntityMem**: EntityMem shows a regression in object fidelity compared to StoryMem, suggesting that its current memory management might be less effective for scene-bound objects vs. character-attached items, or that its base model has integration challenges. This is a valuable diagnostic finding. The scale of the evaluation and the depth of analysis across 51 metrics provide a robust and nuanced understanding of current multi-shot video generation capabilities and their limitations.
The paper demonstrates a strong commitment to reproducibility. The authors explicitly state that "Code and data are available at https://github.com/Catherine-R-He/EntityBench/." The appendix provides extensive details on benchmark statistics, evaluation metrics (formal definitions, grounding parameters, LLM prompts), and pipeline details for EntityMem. The human validation study for LLM judgment adds transparency to the evaluation process. The explicit mention of computational resources (two nodes with 8 NVIDIA L20 GPUs) provides context for the scale of experiments. The detailed breakdown of the data construction process, including LLM agents and tools used, further aids reproducibility.
1. **Object Fidelity Regression**: EntityMem shows a notable drop in object fidelity compared to StoryMem, indicating that its per-entity memory approach might not generalize equally well to all entity types or contexts (e.g., scene-bound objects vs. characters). 2. **Computational Cost**: The benchmark is large (2,491 shots), and the evaluation suite, especially with multimodal LLM judgments and multiple baselines, is computationally intensive. This might limit widespread adoption for rapid iteration, though it's necessary for rigorous evaluation. 3. **LLM Reliance**: While human-validated, the reliance on LLMs for critical judgment tasks (per-entity fidelity, cross-shot identity) introduces a dependency on LLM capabilities and potential biases, even with validation. 4. **Placeholder Venue**: The mention of "iclr2026_conference" suggests the paper is currently an arXiv preprint and has not yet undergone full peer review at a major conference, which is a minor limitation for a comprehensive evaluation. 5. **Institution Information**: The paper text does not contain author affiliations, which is unusual and prevents assessment of the institutional backing of the research.
EntityBench has the potential to become a foundational benchmark for multi-shot video generation, similar to how VBench became a standard for single-shot generation. By providing a standardized, large-scale, and rigorously evaluated dataset and metric suite, it will accelerate research in automated storytelling, previsualization, and long-form content creation. The insights into the "structural disagreement" between embedding-based and LLM-based consistency metrics are crucial for guiding future metric development. EntityMem's success in explicit per-entity memory management opens new avenues for improving long-range consistency in generative models, moving beyond implicit architectural solutions. This work directly addresses a critical challenge for making video generation models practically useful for narrative content. This paper introduces EntityBench, a comprehensive benchmark for evaluating entity consistency in multi-shot video generation, paired with a rigorous three-pillar evaluation framework and a strong baseline method, EntityMem. The work significantly advances the field by providing a much-needed standardized tool for diagnosing and improving long-range entity consistency, a critical challenge for generating coherent visual narratives, and offers valuable insights into the limitations of current evaluation metrics.
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, University of Hong Kong, Columbia University
VECA introduces an important building block for Vision Transformers that addresses the critical issue of quadratic computational scaling, making ViTs more practical for high-resolution imagery and real-time applications. The concept of elastic inference, enabled by nested training, is particularly impactful as it allows dynamic trade-offs between compute and accuracy, which is highly valuable for deployment on diverse hardware and latency constraints. This work challenges the fundamental assumption that direct pairwise token interactions are necessary for rich visual representations, potentially opening new avenues for designing efficient attention mechanisms. Its strong performance across classification, segmentation, and detection suggests broad applicability, potentially accelerating the adoption of ViTs in domains like medical imaging, autonomous driving, and high-resolution video analysis where efficiency is paramount. VECA introduces an elastic core-periphery attention mechanism that achieves linear complexity for Vision Transformers, demonstrating competitive performance across diverse vision tasks while significantly improving computational efficiency and enabling flexible compute-accuracy trade-offs. This paper presents a well-motivated and empirically strong architectural innovation that addresses a critical scalability bottleneck in Vision Transformers, making them more practical for high-resolution applications and offering a valuable elastic inference capability for real-world deployment.
The paper proposes Visual Elastic Core Attention (VECA), an innovative Vision Transformer architecture designed to overcome the quadratic scaling limitations of traditional self-attention. The core idea is to replace direct all-to-all patch interactions with an indirect communication mechanism mediated by a small, fixed set of learned "core" tokens. Specifically, the VECA block introduces Core-Periphery Attention (CPA), where patch tokens interact only with core tokens (Patch-to-Core attention), and core tokens interact with patch tokens (Core-to-Patch attention). Crucially, the core tokens are not derived from the input patches at each layer but are learned from scratch and propagated across layers, acting as a persistent communication interface. This design yields linear computational complexity $O(N \cdot C \cdot D)$ with respect to the number of patches $N$ (for fixed core count $C$ and dimension $D$), making it highly scalable for high-resolution images. A significant methodological contribution is the "nested training along the core axis," which allows the model to be trained with multiple core counts simultaneously. This enables elastic inference, where the number of active core tokens can be adjusted at test time to trade off compute for accuracy, a highly practical feature. The architecture is well-motivated and clearly described, building on ideas from Perceiver-like models but distinguishing itself by maintaining and iteratively updating the full set of input tokens, avoiding a bottleneck.
The experimental evaluation is comprehensive and rigorous, covering standard vision tasks: ImageNet-1K classification, ADE20K semantic segmentation, and COCO object detection/instance segmentation. VECA models (Tiny, Small, Base) are compared against strong baselines including DeiT, Swin Transformer, ConvNeXt, PVT, CoAtNet, and Perceiver. For ImageNet-1K classification, VECA-Base achieves 83.6% top-1 accuracy, competitive with Swin-B (83.3%) and ConvNeXt-B (83.8%), while demonstrating superior throughput and often lower FLOPs, especially when considering higher resolutions. On dense prediction tasks, VECA-Base integrated into UperNet for ADE20K segmentation achieves 49.6 mIoU, matching Swin-B (49.5) and ConvNeXt-B (49.9). For COCO object detection/instance segmentation with Mask R-CNN, VECA-Base achieves 49.0 box AP / 42.9 mask AP, again competitive with Swin-B (49.0/42.8) and ConvNeXt-B (49.6/43.1). The results consistently show that VECA can achieve state-of-the-art performance while significantly improving computational efficiency and scalability. Ablation studies thoroughly validate key design choices, including the impact of core count, core initialization, core propagation, and the effectiveness of nested training for elastic inference. Visualizations of core attention further provide insights into how cores learn to attend to different semantic regions.
The paper provides sufficient architectural details, training configurations, and hyperparameters in the main text and appendix for the core components of VECA. Standard datasets and established frameworks (UperNet, Mask R-CNN) are used for dense tasks. While no explicit code repository URL is provided in the paper, the level of detail suggests that a diligent researcher should be able to reproduce the main results. The use of common benchmarks and clear descriptions of the methodology contribute positively to reproducibility.
The paper does not explicitly list limitations. One potential limitation is that while the core tokens are learned and propagated, the fixed number of cores ($C$) might still represent a bottleneck for extremely complex scenes or tasks requiring very fine-grained global interactions, although the paper demonstrates strong performance across various tasks. The "no direct patch-to-patch interaction" claim, while empirically supported, implies an indirect interaction through the cores, which still allows for information flow across patches. The optimal choice of $C$ for different tasks and resolutions might require some tuning, although the nested training helps mitigate this by providing flexibility. While efficient, the overall complexity is still $O(NCD)$, which is linear in $N$ but still depends on $C$ and $D$.
VECA introduces an important building block for Vision Transformers that addresses the critical issue of quadratic computational scaling, making ViTs more practical for high-resolution imagery and real-time applications. The concept of elastic inference, enabled by nested training, is particularly impactful as it allows dynamic trade-offs between compute and accuracy, which is highly valuable for deployment on diverse hardware and latency constraints. This work challenges the fundamental assumption that direct pairwise token interactions are necessary for rich visual representations, potentially opening new avenues for designing efficient attention mechanisms. Its strong performance across classification, segmentation, and detection suggests broad applicability, potentially accelerating the adoption of ViTs in domains like medical imaging, autonomous driving, and high-resolution video analysis where efficiency is paramount. VECA introduces an elastic core-periphery attention mechanism that achieves linear complexity for Vision Transformers, demonstrating competitive performance across diverse vision tasks while significantly improving computational efficiency and enabling flexible compute-accuracy trade-offs. This paper presents a well-motivated and empirically strong architectural innovation that addresses a critical scalability bottleneck in Vision Transformers, making them more practical for high-resolution applications and offering a valuable elastic inference capability for real-world deployment.
Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.
Primary: Santander AI Lab
All Institutions: Santander AI Lab
This paper has significant broader impact potential, particularly for responsible AI, LLM governance, and model risk management in regulated industries like finance, healthcare, and legal. The identification of a "principal-agent failure" in LLM governance and the proposal of a concrete solution via "mechanical enforcement" is highly relevant. The finding of "governance-task decoupling" is a critical conceptual contribution, challenging the common reliance on task accuracy as a proxy for compliance in AI systems. This could lead to a paradigm shift in how regulated AI systems are designed, evaluated, and audited, promoting safer and more trustworthy deployments of LLMs in high-stakes environments. The proposed governance metrics could also become a standard for evaluating compliance beyond accuracy. This paper introduces a compelling framework for addressing the principal-agent problem in LLM governance within regulated financial systems. By proposing novel rationale-level governance metrics and a "mechanical enforcement" architecture, the authors demonstrate significant improvements in compliance and task accuracy, crucially highlighting a governance-task decoupling that underscores the insufficiency of accuracy alone for regulated AI.
Based solely on the abstract and section headers, the paper proposes a novel approach to LLM governance in regulated financial workflows. The core methodology involves introducing "five governance metrics" to quantify policy compliance at the rationale level, moving beyond mere task accuracy. The key innovation appears to be "mechanical enforcement," described as "four primitives operating outside the model's interpretive loop." This architectural separation aims to address the principal-agent failure inherent when an LLM interprets its own natural-language policies. While the abstract doesn't detail these primitives or metrics, the concept of external, mechanical constraints on LLM behavior for compliance is a promising direction. The focus on the "rationale level" for auditable decisions is also a critical methodological shift for regulated domains. The abstract mentions a "causal ablation" to confirm the necessity of each primitive, suggesting a rigorous experimental design for the proposed mechanisms.
The abstract reports experiments conducted in a "synthetic banking domain," which is a reasonable starting point for high-stakes financial applications. The comparison is between "text-only governance" (the baseline) and the proposed "mechanical enforcement." The results presented are quantitatively strong: mechanical enforcement reduces the rate of non-decision-relevant deferrals by 73%, more than doubles deferral information content, and significantly raises task accuracy (MCC from 0.43 to 0.88). A crucial finding is the "governance-task decoupling," where mechanical enforcement preserves governance quality even as task performance drops under structural stress, unlike text-only governance which degrades on both. This suggests that the proposed method effectively separates compliance from raw task performance, a vital insight for regulated AI. The mention of "comparable CDL" (likely a compliance-related metric) for LLM-generated rationales under both methods, with gains coming from removing clear-cut decisions from the model's control, points to a sophisticated understanding of where the improvements originate. Without the full paper, details on the synthetic domain, dataset construction, specific metrics, and statistical significance cannot be assessed.
The abstract does not provide any information regarding implementation details, code availability, or specific datasets beyond "synthetic banking domain." Therefore, based on the provided text, reproducibility cannot be assessed and is presumed to be low without further information.
The primary limitation explicitly stated is the use of a "synthetic banking domain." While useful for initial validation in a high-stakes area, the generalizability of the "four primitives" and the overall mechanical enforcement approach to real-world, complex financial scenarios or other regulated domains remains to be demonstrated. The abstract also doesn't elaborate on the complexity or overhead of implementing these mechanical enforcement primitives. Without the full paper, it's impossible to assess the specific constraints or assumptions made in the methodology or the potential trade-offs (e.g., computational cost, development effort) of implementing such a system.
This paper has significant broader impact potential, particularly for responsible AI, LLM governance, and model risk management in regulated industries like finance, healthcare, and legal. The identification of a "principal-agent failure" in LLM governance and the proposal of a concrete solution via "mechanical enforcement" is highly relevant. The finding of "governance-task decoupling" is a critical conceptual contribution, challenging the common reliance on task accuracy as a proxy for compliance in AI systems. This could lead to a paradigm shift in how regulated AI systems are designed, evaluated, and audited, promoting safer and more trustworthy deployments of LLMs in high-stakes environments. The proposed governance metrics could also become a standard for evaluating compliance beyond accuracy. This paper introduces a compelling framework for addressing the principal-agent problem in LLM governance within regulated financial systems. By proposing novel rationale-level governance metrics and a "mechanical enforcement" architecture, the authors demonstrate significant improvements in compliance and task accuracy, crucially highlighting a governance-task decoupling that underscores the insufficiency of accuracy alone for regulated AI.