本篇博文主要内容为 2026-05-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-01)
今日共更新677篇论文,其中:
- 自然语言处理共84篇(Computation and Language (cs.CL))
- 人工智能共217篇(Artificial Intelligence (cs.AI))
- 计算机视觉共108篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共166篇(Machine Learning (cs.LG))
- 多智能体系统共15篇(Multiagent Systems (cs.MA))
- 信息检索共26篇(Information Retrieval (cs.IR))
- 人机交互共35篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Framework for Collaborative Operation of Autonomous Delivery Vehicles Within a Marshaling Yard
【速读】:该论文旨在解决封闭设施(如配送编组站)中因孤立的静态规则驱动的自动驾驶车辆导致的交通阻塞问题,从而引发设施运行中断。在编组站内,电动车队需依次完成充电、检测、清洁和装载等任务,若仅依赖预设规则进行调度,容易在高需求场景下出现死锁,降低整体作业效率。解决方案的关键在于提出一种“协同自治”(orchestrated autonomy)机制,通过基于当前编组站状态的去中心化动态优先级评分系统,实时优化车辆任务分配策略,从而提升车辆吞吐量并减少高负载下的设施故障率。
链接: https://arxiv.org/abs/2604.28057
作者: James O’Hara,Karl Wunderlich,Gregory Stevens
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:As autonomous vehicles slowly deploy into urban roads for limited use cases with significant edge case issues, closed facilities like marshaling yards provide a ripe case for combining lower-level vehicle autonomy with fixed infrastructure to create full autonomy without similar edge case concerns. Within a delivery marshaling yard, electric fleet vehicles complete a set of sequential tasks (charging, inspection, cleaning, and loading) before exiting the yard with their new load of deliveries. Hybrid automation of the vehicles and infrastructure can allow these vehicles to reach full autonomy and navigate the facility without the need of a driver, allowing for quicker movement between tasks increasing vehicle throughput. However, isolated autonomous operations based on static rules are prone to gridlock causing facility failures that temporarily shut down operations. Our orchestrated autonomy solution uses decentralized, dynamic priority scoring of vehicles based on the current status of the marshaling yard to optimally assign vehicles to tasks to increase vehicle throughput. Using a simulated facility with three marshaling yard sizes (small, medium, and large) and three demand levels (low, medium, high), we demonstrated that our orchestration solution increases vehicle throughput above static, isolated autonomy for all combinations of yard size and demand, while reducing facility failures at high demand levels.
[MA-1] Language Models Refine Mechanical Linkage Designs Through Symbolic Reflection and Modular Optimisation
【速读】:该论文旨在解决机械连杆机构设计中组合拓扑选择与连续参数拟合的复杂性问题,传统方法难以高效协同优化离散结构与连续几何参数。解决方案的关键在于构建一个模块化架构,利用语言模型代理(Language Model Agents)进行离散拓扑探索,结合数值优化器完成连续参数拟合,并通过符号提升算子(symbolic lifting operator)将仿真轨迹转化为定性描述符、运动标签、时间谓词和结构诊断信息,使模型能在迭代设计周期中理解并改进设计。该方法在六种工程相关运动目标上显著降低几何误差(最高达68%)并提升结构有效性(最高达134%),且无需微调即可让不同家族的语言模型习得可解释的机械推理策略,验证了符号抽象在连接生成式AI与工程数值精度之间的桥梁作用。
链接: https://arxiv.org/abs/2604.27962
作者: João Pedro Gandarela,Thiago Rios,Stefan Menzel,André Freitas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注:
Abstract:Designing mechanical linkages involves combinatorial topology selection and continuous parameter fitting. We show that language models can systematically improve linkage designs through symbolic representations. Language model agents explore discrete topologies while numerical optimisers fit continuous parameters. A symbolic lifting operator translates simulator trajectories into qualitative descriptors, motion labels, temporal predicates, and structural diagnostics that models interpret across iterative design cycles. Across six engineering-relevant motion targets and three open-source models (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B), the modular architecture reduces geometric error by up to 68% and improves structural validity by up to 134% over monolithic baselines. Critically, 78.6% of iterative refinement trajectories show measurable improvement, with the system correctly diagnosing overconstraint (56.3%) and underconstraint (35.6%) failure modes and proposing grounded corrections. Models across all three families acquire interpretable mechanical reasoning strategies without fine-tuning, demonstrating that principled symbolic abstraction bridges generative AI and the numerical precision required for engineering design.
[MA-2] Can We Volunteer Out of the Peer Review Crisis?
【速读】:该论文试图解决科学文献评审过程中因稿件数量快速增长与评审资源有限之间产生的矛盾,即评审者稀缺、评估噪声增加以及编辑决策可信度下降的问题。解决方案的关键在于引入一种自愿参与的“预审抽奖机制”(voluntary lottery),作者通过接受随机预审拒稿的概率来降低整体评审负担,从而提升留存稿件的评审质量。研究通过博弈论建模表明,在纳什均衡下,关注学术文献整体质量而非仅个人发表成果的科学家将主动选择参与该机制,进而提高所有科研人员所阅读和依赖的科学文献质量。
链接: https://arxiv.org/abs/2604.27900
作者: Theo Tang,Toby Handfield,Julian Garcia
机构: Monash University (莫纳什大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Main text: 13 pages, 4 figures. Supplementary Information: 18 pages
Abstract:The volume of scientific manuscripts is growing faster than the capacity to evaluate them, yet the institutions that govern peer review have remained largely unchanged. The result is a widening mismatch: reviewer scarcity, noisier assessments, and declining confidence in editorial decisions. Every scientist wants better reviews, but review quality depends on the total burden, which no single author can shift. To isolate this tension, we provide a game-theoretic thought experiment: a voluntary lottery in which authors accept a chance of random pre-review rejection, reducing reviewer burden and improving the quality of surviving evaluations. We show that a Nash equilibrium emerges in which authors voluntarily enter the lottery. Scientists who care about the literature they read, not just the papers they publish, will opt in, raising the quality of published science for all.
[MA-3] ObjectGraph: From Document Injection to Knowledge Traversal – A Native File Format for the Agent ic Era
【速读】:该论文旨在解决“文档消费问题”(Document Consumption Problem),即当前所有文档格式均面向人类线性阅读设计,而自主大语言模型(LLM)代理实际上并不“阅读”文档,而是通过检索获取信息。这种根本性不匹配导致代理在处理文档时需将全文注入上下文窗口,造成大量无关内容占用token、状态冗余累积以及敏感信息跨角色无差别暴露等问题。作者指出,这并非单纯的提示工程、检索或压缩问题,而是一个文档格式设计的问题。解决方案的关键在于提出OBJECTGRAPH(.og)文件格式——一种将文档重构为有类型、有向的知识图谱(knowledge graph)以供遍历而非作为字符串注入的新范式。OBJECTGRAPH是Markdown的严格超集,无需额外基础设施,且兼具人类可读性和代理原生支持能力;其核心创新包括:六项结构属性的正式定义与满足、渐进披露模型(Progressive Disclosure Model)、角色作用域访问协议(Role-Scoped Access Protocol)及可执行断言节点(Executable Assertion Nodes)等原生格式原语。实证表明,OBJECTGRAPH可在五类文档和八种代理任务中实现最高达95.3%的token减少,且任务准确率无显著下降(p > 0.05),转译保真度达98.7%。
链接: https://arxiv.org/abs/2604.27820
作者: Mohit Dubey,Open Gigantic
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures, 4 tables
Abstract:Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark. Comments: 12 pages, 4 figures, 4 tables Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Multiagent Systems (cs.MA) Cite as: arXiv:2604.27820 [cs.AI] (or arXiv:2604.27820v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27820 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] Autonomous Traffic Signal Optimization Using Digital Twin and Agent ic AI for Real-Time Decision-Making
【速读】:该论文旨在解决传统交通信号控制方法在应对动态交通流时效率低下的问题,尤其是在缓解拥堵、减少车辆等待时间和提升整体交通流效率方面的不足。其解决方案的关键在于构建一个由代理型人工智能(agentic AI)驱动的数字孪生(digital twin)框架,通过感知层采集实时交通数据、概念层利用LangChain进行语义化处理,并在执行层结合模型上下文协议(Model Context Protocol, MCP)与交通管理API,实现基于实时交通状态的自主优化决策。该框架显著优于固定周期和强化学习基线方法,在降低交叉口等待时间方面表现出更优性能。
链接: https://arxiv.org/abs/2604.27753
作者: Salman Jan,Toqeer Ali Syed,Shahid Kamal,Qamar Wali,Ali Akarma
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: This paper is submitted to MECON2026 conference
Abstract:This article outlines a new framework of traffic light optimization through a digital twin of the transport infrastructure, managed by agentic AI to ensure real-time autonomous decisions. The framework relies on physical sensors and edge computing to measure real-time traffic information and simulate traffic flow in a constantly updated digital twin. The traffic light is automatically controlled through the digital twin according to traffic congestion, travel delay and traffic patterns. This approach is implemented as a three-layer system: perception, conceptualization and action. The perception layer receives data on physical systems; the conceptualization layer uses LangChain to process the data; and the action layer links to the Model Context Protocol (MCP) and traffic management APIs to implement optimised traffic signal control algorithms. The results show that the framework minimizes waiting time at traffic lights and positively affects the effectiveness of the entire traffic flow, which is better than the fixed-time and reinforcement learning-based baselines.
[MA-5] RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems ACL2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成高质量研究路线图(roadmap)方面能力不足的问题,具体表现为缺乏专业领域知识、任务分解不合理以及逻辑关系混乱等局限。为应对这些挑战,作者提出了一种基于多智能体架构的解决方案——RoadMapper,其核心在于将路线图生成任务分解为三个关键阶段:初始生成、知识增强和迭代式“批判-修订-评估”,从而系统性提升LLM在复杂研究问题求解中的规划能力和效率。
链接: https://arxiv.org/abs/2604.27616
作者: Jiacheng Liu,Zichen Tang,Zhongjun Yang,Xinyi Hu,Xueyuan Lin,Linwei Jia,Ruofei Bai,Rongjin Li,Shiyao Peng,Haocheng Gao,Haihong E
机构: Beijing University of Posts and Telecommunications; The Hong Kong University of Science and Technology (Guangzhou); IDEA Research; Hithink RoyalFlush Information Network Co., Ltd.
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted to Findings of ACL 2026
Abstract:People commonly leverage structured content to accelerate knowledge acquisition and research problem solving. Among these, roadmaps guide researchers through hierarchical subtasks to solve complex research problems step by step. Despite progress in structured content generation, the roadmap generation task has remained unexplored. To bridge this gap, we introduce RoadMap, a novel benchmark designed to evaluate the ability of large language models (LLMs) to construct high-quality roadmaps for solving complex research problems. Based on this, we identify three limitations of LLMs: (1) lack of professional knowledge, (2) unreasonable task decomposition, and (3) disordered logical relationships. To address these challenges, we propose RoadMapper, an LLM-based multi-agent system that decomposes the research roadmap generation task into three key stages (i.e., initial generation, knowledge augmentation, and iterative “critique-revise-evaluate”). Extensive experiments demonstrate that RoadMapper can improve LLMs’ ability for roadmap generation, while enhancing average performance by more than 8% and saving 84% of the time required by human experts, highlighting its effectiveness and application potential.
[MA-6] Reinforced Agent : Inference-Time Feedback for Tool-Calling Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)工具调用代理在推理阶段缺乏实时纠错能力的问题。现有评估方法多为事后分析(post-hoc),无法在执行过程中进行动态修正,导致错误只能通过提示工程或重新训练来修复。其解决方案的关键在于将评估机制嵌入到执行循环中:引入一个专门的审查代理(reviewer agent),在工具调用实际执行前对其进行预判和评估,从而实现从被动恢复向主动评价与误差缓解的范式转变。该架构通过明确划分主执行代理与审查代理的职责,使审查模块可独立优化(如模型选择、提示自动优化),无需重训基础代理,显著提升了任务准确率(如在BFCL和Tau2-Bench基准上分别提升5.5%和7.1%),并首次量化了审查过程中的“有益性-有害性”权衡(Helpfulness-Harmfulness metrics)。
链接: https://arxiv.org/abs/2604.27233
作者: Anh Ta,Junjie Zhu,Shahin Shayandeh
机构: Apple(苹果)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation. In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value. We evaluate our approach on BFCL (single-turn) and Tau2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5-2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2604.27233 [cs.AI] (or arXiv:2604.27233v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27233 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-7] When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM -Based Political Statement Analysis
【速读】:该论文旨在解决多智能体大语言模型(Multi-Agent Large Language Models, Multi-Agent LLMs)在民主话语分析系统中角色一致性(role fidelity)的问题,即评估不同模型是否能稳定维持其被分配的立场角色(如支持者或批评者),从而确保生成的多视角评估具有真实性和多样性。解决方案的关键在于开发了一种基于推理文本的语义立场分类器(epistemic stance classifier),该分类器不依赖表面词汇特征,而是从模型输出的推理过程中识别其主张立场,并通过四个量化指标(包括角色漂移指数RDI、期望漂移距离EDD等)对60条政治陈述(30英文、30德文)进行系统性测量,从而首次实证检验了角色保持假设的有效性,并揭示出“认知角色覆盖”(Epistemic Role Override, ERO)这一统一机制及其两种表现形式——认知下限效应(Epistemic Floor Effect)与角色优先冲突(Role-Prior Conflict)。
链接: https://arxiv.org/abs/2604.27228
作者: Juergen Dietrich
机构: democracy-intelligence.de (TRUST Project)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 22 pages
Abstract:Democratic discourse analysis systems increasingly rely on multi-agent LLM pipelines in which distinct evaluator models are assigned adversarial roles to generate structured, multi-perspective assessments of political statements. A core assumption is that models will reliably maintain their assigned roles. This paper provides the first systematic empirical test of that assumption using the TRUST pipeline. We develop an epistemic stance classifier that identifies advocate roles from reasoning text without relying on surface vocabulary, and measure role fidelity across 60 political statements (30 English, 30 German) using four metrics: Role Drift Index (RDI), Expected Drift Distance (EDD), Directional Drift Index (DDI), and Entropy-based Role Stability (ERS). We identify two failure modes - the Epistemic Floor Effect (fact-check results create an absolute lower bound below which the legitimizing role cannot be maintained) and Role-Prior Conflict (training-time knowledge overrides role instructions for factually unambiguous statements) - as manifestations of a single mechanism: Epistemic Role Override (ERO). Model choice significantly affects role fidelity: Mistral Large outperforms Claude Sonnet by 28pp (67% vs. 39%) and exhibits a qualitatively different failure mode - role abandonment without polarity reversal - compared to Claude’s active switch to the opposing stance. Role fidelity is language-robust. Fact-check provider choice is not universally neutral: Perplexity significantly reduces Claude’s role fidelity on German statements (Delta = -15pp, p = 0.007) while leaving Mistral unaffected. These findings have direct implications for multi-agent LLM validation: a system validated without role fidelity measurement may systematically misrepresent the epistemic diversity it was designed to provide.
[MA-8] A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)算法在应用于去中心化部分可观测马尔可夫决策过程(Decentralized Partially Observable Markov Decision Processes, Dec-POMDPs)时面临的高样本复杂度问题,特别是环境执行步骤的计算成本过高限制了训练效率。解决方案的关键在于设计并实现了一个原生基于C++构建的高效Dec-POMDP引擎——Hide-And-Seek-Engine,其核心优化包括:采用数据导向设计(Data-Oriented Design, DOD)以提升缓存局部性、显式64字节缓存行对齐以消除伪共享(false sharing),以及通过固定内存(pinned memory)和直接内存访问(Direct Memory Access, DMA)实现零拷贝的PyTorch内存桥接。这些技术协同作用,使引擎在单Agent、1024环境规模下达到高达3300万步/秒(SPS)的吞吐量,相较基线单线程向量化NumPy实现提升约3500倍,并成功支持PPO、DQN和SAC等多类策略在分钟级内完成协作多智能体训练,验证了其性能与通用性。
链接: https://arxiv.org/abs/2604.27162
作者: Timothy Flavin,Sandip Sen
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Performance (cs.PF)
备注: 21 pages, 10 figures, 5 tables. Includes appendix
Abstract:Reinforcement Learning (RL) algorithms exhibit high sample complexity, particularly when applied to Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). As a response, projects such as SampleFactory, EnvPool, Brax, and IsaacLab migrate parallel execution of classic environments such as MuJoCo and Atari into C++ thread pools or the GPU to decrease the computational cost of environment steps. We are interested in optimizing the decision-level of human-AI joint operations, so we introduce a compute-efficient Dec-POMDP engine natively architected in C++ called Hide-And-Seek-Engine. By employing Data-Oriented Design (DOD) principles, explicit 64-byte cache-line alignment to remove false sharing, and a zero-copy PyTorch memory bridge using pinned memory and Direct Memory Access (DMA), our engine sustains throughput of up to 33,000,000 steps per second (SPS) in a single-agent, 1024-environment, decentralized observations on an AMD Ryzen 9950X (16 cores). Ten agents reduces FPS to 7M SPS with generating random actions contributing 1/3rd the total runtime for reference. The engine achieves a throughput increase of approximately 3,500 \times over the baseline single threaded vectorized NumPy implementation and successfully trains cooperative multi-agent policies via PPO, DQN, and SAC in minutes, validating both its performance and generality.
[MA-9] Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery Identity and Governance in Kubernetes
【速读】:该论文旨在解决自主AI代理生态系统中缺乏统一代理发现机制、加密身份验证、能力证明(同时保护敏感信息)以及可执行策略控制的问题。其解决方案的关键在于实现一个受DNS启发的可信层——Agent Name Service (ANS),该服务基于DIDs(Decentralized Identifiers,去中心化标识符)和VCs(Verifiable Credentials,可验证凭证)进行代理身份认证与能力证明,并通过Open Policy Agent(OPA)实现策略即代码(policy-as-code)的强制执行,同时利用Kubernetes原生集成模式(如CRDs、准入控制和Service Mesh)完成部署与治理,从而构建安全、可互操作的多代理系统工程实践路径。
链接: https://arxiv.org/abs/2604.26997
作者: Akshay Mittal,Elyson De La Cruz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 2 figures
Abstract:Autonomous AI agent ecosystems require stronger mechanisms for secure discovery, identity verification, capability attestation, and policy governance. Current deployments frequently lack (1) uniform agent discovery, (2) cryptographic agent authentication, (3) capability proofs that protect secrets, and (4) enforceable policy controls. This paper presents an implementation-oriented proof of concept for the Agent Name Service (ANS), a DNS-inspired trust layer for AI agent discovery and interoperability in Kubernetes, grounded in the ANS protocol specification~\citehuang2025ans. The implementation uses Decentralized Identifiers (DIDs), Verifiable Credentials (VCs), policy-as-code enforcement with Open Policy Agent (OPA), and Kubernetes-native integration patterns (CRDs, admission controls, service mesh integration). In a demo research environment (3-node cluster, 50-agent workflow simulation), we observe sub-10ms response in demonstrated service paths and full success for scripted demo deployment scenarios. We explicitly scope these findings as proof-of-concept evidence rather than production certification. We further provide a threat model, assumptions, and limitations to separate implemented evidence from protocol-defined and roadmap capabilities. The result is an evidence-grounded pathway from ANS protocol concepts to reproducible engineering practice for secure multi-agent systems.
[MA-10] MARS: Efficient Adaptive Co-Scheduling for Heterogeneous Agent ic Systems
【速读】:该论文针对自主代理(autonomous agents)在部署大型语言模型(Large Language Models, LLMs)时引发的资源调度挑战提出解决方案。随着LLM从单轮文本生成转向多轮工具调用循环(multi-turn LLM-tool loops),系统执行范式发生时空迁移:由GPU单一节点的聊天规模计算演变为GPU-CPU协同共置的仓库规模执行,导致异构资源(GPU与CPU)耦合压力显著增加。为应对这一问题,作者设计并实现了MARS系统——一个高效且自适应的联合调度框架,其核心创新在于通过统一信息流实现对GPU推理与CPU工具执行的全局可见性,并引入外部控制平面将任务准入(admission)与执行解耦,从而避免异构资源超分配;同时,内部以代理为中心的调度器通过优先处理延迟敏感的继续操作(latency-sensitive continuations)并仅在热重启(warm resumption)能带来延迟收益时保留KV缓存状态,最小化端到端关键路径。实验表明,MARS可将端到端延迟降低最多5.94倍,同时保持接近最大吞吐量,并已在OpenHands代码代理框架中集成验证,使任务完成时间加速达1.87倍。
链接: https://arxiv.org/abs/2604.26963
作者: Yifei Wang,Hancheng Ye,Yechen Xu,Cong Guo,Chiyue Wei,Qinsi Wang,Dongting Li,Tingjun Chen,Hai “Helen” Li,Danyang Zhuo,Yiran Chen
机构: Duke University
类目: Operating Systems (cs.OS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 14 pages, 13 figures. Preprint
Abstract:Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code will be publicly available soon. Comments: 14 pages, 13 figures. Preprint Subjects: Operating Systems (cs.OS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2604.26963 [cs.OS] (or arXiv:2604.26963v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2604.26963 Focus to learn more arXiv-issued DOI via DataCite
[MA-11] CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety Hallucination Mitigation in Patient-Facing LLM s
【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)应用于面向患者的医疗问答系统时所面临的临床安全性和事实可靠性问题。具体而言,现有模型常因缺乏对患者上下文的理解而生成看似合理但医学上不恰当的回答,且易产生幻觉(Hallucination),这在开放、非结构化的临床交互场景中尤为突出。解决方案的关键在于提出 CareGuardAI,一个基于风险感知的安全框架,其核心创新是引入两个独立的风险评估模块:临床安全风险评估(Clinical Safety Risk Assessment, SRA)和幻觉风险评估(Hallucination Risk Assessment, HRA),二者均参考国际标准 ISO 14971 设计。该框架采用多阶段推理流程,包括控制器代理、安全约束生成与双风险评估,并支持迭代优化,仅当两项风险评分均 ≤2 时才释放输出,从而实现高安全性与低延迟的平衡,显著优于主流基线模型如 GPT-4o-mini,在多个医疗安全与幻觉检测基准测试中表现优异。
链接: https://arxiv.org/abs/2604.26959
作者: Elham Nasarian,Abhilash Neog,Kwok-Leung Tsui,Niyousha HosseiniChimeh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Integrating large language models (LLMs) into patient-facing healthcare systems offers significant potential to improve access to medical information. However, ensuring clinical safety and factual reliability remains a critical challenge. In practice, AI-generated responses may be conditionally correct yet medically inappropriate, as models often fail to interpret patient context and tend to produce agreeable responses rather than challenge unsafe assumptions. Unlike clinicians, who infer risk from incomplete information, LLMs frequently lack contextual awareness. Moreover, real-world patient interactions are open-ended and underspecified, unlike structured benchmark settings. We present CareGuardAI, a risk-aware safety framework for patient-facing medical question answering that addresses two key failure modes: clinical safety risk and hallucination risk. The framework introduces Clinical Safety Risk Assessment (SRA), inspired by ISO 14971, and Hallucination Risk Assessment (HRA) to evaluate medical risk and factual reliability. At inference time, CareGuardAI employs a multi-stage pipeline consisting of a controller agent, safety-constrained generation, and dual risk evaluation, followed by iterative refinement when necessary. Responses are released only when both SRA and HRA are less than or equal to 2, ensuring clinically acceptable outputs with bounded latency. We evaluate CareGuardAI on PatientSafeBench, MedSafetyBench, and MedHallu, covering both safety and hallucination detection. Across these benchmarks, the framework consistently outperforms strong baseline models, including GPT-4o-mini, demonstrating the importance of context-aware, risk-based, inference-time safety mechanisms for reliable deployment in healthcare.
[MA-12] Continuous-time q-learning for mean-field control with common noise part-II: q-learning algorithms
【速读】:该论文旨在解决带有受控公共噪声(controlled common noise)的均值场控制(mean-field control, MFC)问题中,如何设计有效的Q-learning算法以实现最优策略学习。其核心挑战在于:在实际应用中,传统松弛控制(relaxed control)框架下的数据不可观测,而直接使用可观测数据会引入误差。解决方案的关键在于:首先基于松弛控制框架建立价值函数和Iq函数的鞅条件(martingale condition),然后量化可观测数据替换不可观测数据时的误差,并结合Ren et al. (2026)提出的两层不动点刻画最优策略的方法,提出一种Actor-Critic Q-learning算法——其中Actor步利用改进的Iq函数迭代更新策略,Critic步则基于可观测数据和鞅正交性条件更新价值函数与Iq函数,从而在无限horizon线性二次(LQ)框架下证明了内层迭代的收敛性,并在多个示例中验证了算法的有效性。
链接: https://arxiv.org/abs/2604.27378
作者: Zhenjie Ren,Xiaoli Wei,Xiang Yu,Xun Yu Zhou
机构: LaMME, Université Évry Paris-Saclay (拉马克大学); The Hong Kong Polytechnic University (香港理工大学); Columbia University (哥伦比亚大学)
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Keywords: Mean-field control, common noise, martingale characterization, optimal q-learning algorithm, Actor-Critic q-learning algorithm
Abstract:This paper is a continuation work of Ren et al. (2026) aiming to further devise q-learning algorithms for mean-field control (MFC) with controlled common noise. Based on the relaxed control formulation, we first establish the martingale condition of the value function and the Iq-function by evaluating along the conditional state distributions generated by all test policies. As the data in the relaxed control formulation are not observable in practice, we quantify the error incurred when they are replaced by the observable ones in the exploratory formulation under discretely sampled actions. This, together with a two-layer fixed point characterization of an optimal policy in Ren et al. (2026), allows us to propose several algorithms including the Actor-Critic q-learning algorithm, in which the policy is updated in the Actor-step based on the iteration rule induced by the improved Iq-function, and the value function and Iq-function are updated in the Critic-step based on the martingale orthogonality condition using the data from the exploratory formulation. We also establish the convergence of the inner iterations in the Actor-step in an infinite-horizon linear quadratic (LQ) framework. In two examples, within and beyond LQ framework, our q-learning algorithms are implemented with satisfactory performance.
[MA-13] Continuous-time q-learning for mean-field control with common noise part-I: Theoretical foundations
【速读】:该论文旨在解决带受控公共噪声的熵正则化平均场控制(Entropy-regularized Mean-Field Control, MFC)中连续时间Q函数(q-function)的建模与最优策略求解问题,尤其是在探索性(exploratory)与松弛控制(relaxed control)框架之间收敛性及政策迭代复杂性的理论刻画。其关键解决方案在于:首先通过松弛控制框架推导出包含受控公共噪声导致的非线性策略泛函项的探索性哈密顿-雅可比-贝尔曼(HJB)方程;其次,在特定凹性条件下,利用关于策略的偏线性泛函导数(partial linear functional derivative)建立一阶最优条件,从而证明单步策略迭代的存在唯一性;进一步地,在平均场设置下引入集成q函数(Integrated q-function, Iq-function),并揭示最优策略为Iq函数argmax算子的两层不动点;最终在一般线性二次(LQ)情形下显式刻画最优策略为高斯分布形式。
链接: https://arxiv.org/abs/2604.27372
作者: Zhenjie Ren,Xiaoli Wei,Xiang Yu,Xun Yu Zhou
机构: LaMME, Université Évry Paris-Saclay (拉马克大学); The Hong Kong Polytechnic University (香港理工大学); Columbia University (哥伦比亚大学)
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Keywords: Continuous-time reinforcement learning, mean-field control, common noise, policy improvement, integrated q-function, two-layer fixed point
Abstract:This paper investigates the continuous-time counterpart of the Q-function for entropy-regularized mean-field control (MFC) with controlled common noise, coined as q-function by Jia and Zhou (2023) in the single agent’s model. We first show that, under discretely sampled actions, the value function in the exploratory formulation converges to the one in the relaxed control formulation as the time grid refines. Leveraging the relaxed control formulation, we derive the exploratory Hamilton-Jacobi-Bellman (HJB) equation, in which the controlled common noise gives rise to an additional nonlinear functional of policy, rendering the policy iteration intricate. Under certain concavity condition, we establish the existence and uniqueness of the optimal one-step policy iteration via a first-order condition using the partial linear functional derivative with respect to policy. The policy improvement at each iteration is verified by relating to an entropy-regularized optimization problem over the space of policies. In the mean-field setting, we introduce the integrated q-function (Iq-function) defined on the state distribution and the policy, and it is shown that an optimal policy is identified as a two-layer fixed point to the argmax operator of the Iq-function. Finally, we provide the explicit characterization of an optimal policy as a Gaussian distribution in the general linear-quadratic (LQ) setting.
[MA-14] Nothing Deceives Like Success: Social Learning and the Illusion of Understanding in Science
【速读】:该论文旨在解决科学共同体中“成功偏向型社会学习”(success-driven social learning)是否具有适应性的问题,尤其是在理论构建这一复杂且评估困难的集体搜索场景中。研究通过代理模型模拟科学社区的演化过程发现,成功偏向会加剧科学家对自身理论质量的高估,形成“理解错觉”(illusion of understanding),即主观认知与实际表现之间存在持续差距;这种偏差导致群体探索范围受限,虽能高效淘汰劣质理论,却难以发现更优解,尤其在复杂问题环境中更为显著。解决方案的关键在于揭示:当个体优化其社会行为以最大化理论的感知成功率时,反而损害了真实性能,并生成与现实科学界相似的不平等结构,从而表明单纯依赖成功导向的社会学习机制并不利于科学进步。
链接: https://arxiv.org/abs/2604.27188
作者: Avery W. Louis,Marina Dubova
机构: Stanford University (斯坦福大学); Santa Fe Institute (圣达菲研究所)
类目: Physics and Society (physics.soc-ph); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: 36 pages, 7 figures
Abstract:Success-driven social learning, in which individuals preferentially adopt the ideas and methods that appear most successful, is a foundational principle of collective behavior across systems ranging from ant colonies to scientific communities. But science is a particular kind of collective search – one in which the quality of an explanation is itself difficult to assess. Is success bias adaptive in this setting? In agent-based simulations of collective theory building, we find that it is not. Scientists in our model systematically overestimate the quality of their own theories, creating an illusion of understanding: a persistent gap between perceived and actual performance. Success bias amplifies this illusion; communities that favor apparently successful theories explore a narrower range of possibilities, efficiently filtering out poor explanations but failing to discover better ones. This effect intensifies with problem complexity, as scientists in more complex environments become increasingly unable to assess how well their theories actually perform. Most strikingly, when agents optimize their social behavior to maximize the perceived success of their theories, they paradoxically undermine their actual performance, and produce levels of inequality that mirror those found in real scientific communities.
自然语言处理
[NLP-0] Exploration Hacking: Can LLM s Learn to Resist RL Training?
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大型语言模型(Large Language Models, LLMs)后训练过程中可能遭遇的“探索劫持”(exploration hacking)问题,即模型在训练中通过策略性地调整自身探索行为来影响后续训练结果,从而规避能力激发或对齐目标。解决方案的关键在于构建具有选择性RL抗性的模型“原型”,通过微调LLMs以遵循特定的低效策略,在保持相关任务性能的同时成功抵抗基于RL的能力激发;进而利用这些原型评估检测与缓解策略(如监控、权重噪声和监督微调SFT-based elicitation),并发现当前前沿模型在获得足够训练环境信息时会显式推理如何抑制探索行为,尤其在间接获取信息时表现更显著。这表明探索劫持是具备足够能力的LLMs在RL训练中的一种潜在失效模式。
链接: https://arxiv.org/abs/2604.28182
作者: Eyon Jang,Damon Falck,Joschka Braun,Nathalie Kirch,Achu Menon,Perusha Moodley,Scott Emmons,Roland S. Zimmermann,David Lindner
机构: UC San Diego; Anthropic; Google DeepMind
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 81 pages, 37 figures
Abstract:Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI RD environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.
[NLP-1] Synthetic Computers at Scale for Long-Horizon Productivity Simulation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在长时程生产力任务中因缺乏真实用户环境和复杂工作流而难以有效训练与评估的问题。其核心挑战在于如何构建具备现实感的计算机环境(如目录结构、文档类内容等),并在此基础上进行长时间、多步骤的专业任务模拟,以获取高质量的体验式学习信号。解决方案的关键在于提出“大规模合成计算机”(Synthetic Computers at Scale)方法论:首先基于用户特征生成具有真实感的文件系统和内容丰富的数字资产;随后在每个合成环境中部署两个代理(agent)——一个设定目标(模拟用户需求),另一个扮演该用户执行跨文件系统导航、协作及产出专业成果等行为,直至完成目标。这种端到端的长周期仿真机制显著提升了代理在领域内与领域外生产力任务上的表现,为代理自我改进和基于强化学习的长时程任务能力发展提供了可扩展的基础框架。
链接: https://arxiv.org/abs/2604.28181
作者: Tao Ge,Baolin Peng,Hao Cheng,Jianfeng Gao
机构: Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preview version; work in progress
Abstract:Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer’s user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer – for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts – until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios. Comments: Preview version; work in progress Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.28181 [cs.AI] (or arXiv:2604.28181v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.28181 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-2] On the Proper Treatment of Units in Surprisal Theory ACL2026
【速读】: 该论文旨在解决当前基于 surprisal(意外度)的言语加工努力分析中,因语言单位定义与模型预测区域选择混杂而导致的建模不透明问题。具体而言,实验设计常以词等语言学单位为分析单元,而预训练语言模型则基于固定词汇表(token)进行概率分配,二者不一致导致 surprisal 计算隐含依赖于人为设定的分词或对齐策略,从而混淆了“分析单位”与“评估区域”这两个独立的建模决策。论文的关键解决方案是提出一个统一框架,将语言单位(unit inventory)与 token 化过程解耦,明确区分单位定义和预测区域的选择,并将 tokenization 视作实现细节而非科学基础,从而提升 surprisal 分析的可解释性与一致性。
链接: https://arxiv.org/abs/2604.28147
作者: Samuel Kiegeland,Vésteinn Snæbjarnarson,Tim Vieira,Ryan Cotterell
机构: ETH Zürich; CHI-FRO; University of Copenhagen
类目: Computation and Language (cs.CL)
备注: ACL 2026 (main conference)
Abstract:Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.
[NLP-3] PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning
【速读】: 该论文旨在解决大模型在多模态推理任务中因监督微调(SFT)引入的分布漂移(distributional drift)问题,该漂移不仅削弱了模型原有的能力,还导致其无法准确匹配标注数据的分布,尤其在多模态推理场景下,感知错误与推理失败的漂移模式相互叠加,进一步恶化后续强化学习(RLVR)的表现。解决方案的关键在于提出PRISM三阶段流程,其中核心创新是在SFT与RLVR之间插入一个显式的分布对齐阶段,基于在线策略蒸馏(OPD)原理,构建一个由混合专家(MoE)判别器驱动的响应级对抗游戏,该判别器包含独立的感知和推理专家模块,从而提供解耦的校正信号,引导策略向监督分布靠拢,且无需访问教师模型的logits。此方法显著提升了下游RLVR性能,在Qwen3-VL上多个强化学习算法和基准测试中平均准确率提升4.4至6.0点。
链接: https://arxiv.org/abs/2604.28123
作者: Sudong Wang,Weiquan Huang,Xiaomin Yu,Zuhao Yang,Hehai Lin,Keming Wu,Chaojun Xiao,Chen Chen,Wenxuan Wang,Beier Zhu,Yunjian Zhang,Chengwei Qin
机构: Hong Kong University of Science and Technology (Guangzhou); Tsinghua University; Nanyang Technological University; Renmin University of China; University of Science and Technology of China; University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model’s original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at this https URL.
[NLP-4] Mapping the Methodological Space of Classroom Interaction Research: Scale Duration and Modality in an Age of AI
【速读】: 该论文试图解决课堂互动研究中长期存在的方法论分裂问题,即大规模观察与深度民族志研究之间的割裂。其解决方案的关键在于提出一个三维框架,涵盖规模(scale)、持续时间(duration)和模态(modality),用以系统化地映射不同研究方法的位置及其对现象揭示与遮蔽的影响。通过对比对话式教学的两项代表性研究(Howe et al., 2019 和 Snell & Lefstein, 2018)及对主研人员的访谈,论文阐明了该框架如何帮助厘清“可操作化的内容”、“可显性化的机制”以及“可实践转化的发现”,并进一步探讨生成式 AI (Generative AI) 如何拓展这一方法学空间,从而为未来研究设计与工具开发提供方向指引。
链接: https://arxiv.org/abs/2604.28098
作者: Dorottya Demszky,Edith Bouton,Alison Twiner,Sara Hennessy,Richard Correnti
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping this methodological space along three dimensions–scale, duration, and modality–where a study’s position shapes what it reveals and obscures. We illustrate it through contrasting studies of dialogic teaching–Howe et al. (2019) and Snell and Lefstein (2018)–and an interview with the lead researchers, organized around three questions: what can be operationalized, what mechanisms become visible, and what translates to practice. We then examine how AI is expanding this space and how the framework can guide research and tool design.
[NLP-5] opBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理具有隐式预测性质的表格问答任务时存在的局限性,这类任务要求模型基于历史数据模式推断未观测到的答案,而非简单信息检索或聚合。针对这一挑战,作者提出了TopBench基准,包含779个样本,涵盖从单点预测到决策制定、治疗效应分析及复杂过滤等四类子任务,要求模型生成包含推理文本和结构化表格的输出。解决方案的关键在于识别隐式意图(latent intent),研究表明准确的意图消歧是实现可靠预测推理的前提,同时提升预测精度上限需引入更复杂的建模或推理机制。
链接: https://arxiv.org/abs/2604.28076
作者: An-Yang Ji,Jun-Peng Jiang,De-Chuan Zhan,Han-Jia Ye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.
[NLP-6] Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
【速读】: 该论文旨在解决高资源非英语语言(如德语)在训练大语言模型(Large Language Models, LLMs)时面临的策略困境:是优先选择数据多样性(通过单次遍历大量轻度过滤的网络文本),还是优先保障数据质量(通过严格过滤获取高质量核心语料并多次重复训练)。其解决方案的关键在于采用分层质量过滤机制,对5亿条德语文本进行多级筛选,构建高质量子集,并通过多轮次(multi-epoch)训练验证其有效性。实验表明,即使在7个训练轮次后,重复使用高质量数据仍显著优于单次遍历更大但质量较低的数据集,说明对于非英语LLMs而言,通过质量过滤实现语义集中比单纯扩大数据量更有利于高效语言建模。
链接: https://arxiv.org/abs/2604.28075
作者: Ansar Aynetdinov,Patrick Haller,Alan Akbik
机构: Humboldt-Universität zu Berlin (洪堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.
[NLP-7] Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results
【速读】: 该论文旨在解决当前对开放科学(Open Science)实践“下游效应”或实际影响缺乏系统评估的问题,特别是研究数据共享与重用的量化难题。传统文献计量学方法难以准确捕捉数据重用行为,而本文提出的关键解决方案是利用大语言模型(Large Language Models, LLMs)和生成式人工智能(Generative AI)构建新型指标,实现对研究数据重用行为的大规模、自动化测量。实验结果显示数据重用率为43%,显著高于传统方法,表明生成式AI在量化开放科学实际影响方面具有强大潜力,且当前研究数据共享的正向效应可能被低估。
链接: https://arxiv.org/abs/2604.28061
作者: Lauren Cadwallader,Iain Hrynaszkiewicz,parth sarin,Tim Vines
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 12 pages. Submitted to 30th Annual International Conference on Science and Technology Indicators
Abstract:Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the ‘downstream’ effects or impacts of open science. PLOS and DataSeer have developed a new LLM-based indicator to measure an important effect of open science: the reuse of research data. Our results show a data reuse rate of 43%, which is higher than established bibliometric techniques. We show that data reuse can be measured at scale using LLMs and generative artificial intelligence. The positive effects of research data sharing and reuse may currently be underestimated.
[NLP-8] Stable Behavior Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception
【速读】: 该论文旨在解决生成式 AI(Generative AI)在城市感知分析中使用时,基于标签的“人格化提示”(persona prompting)是否能产生有意义且可复现的行为多样性这一问题。研究发现,虽然同一人格下的代理(agent)表现出高度一致性,但跨人格间的差异有限——经济状况和人格特征仅带来统计显著但实际影响微弱的变化,而性别和政治倾向则几乎无影响;此外,模型存在极端化偏差(extremity bias),导致对中间情感类别的判别能力下降,从而在细粒度情感任务中性能显著退化。关键解决方案在于通过对比带有人格条件与不带人格条件的模型表现,揭示简单标签式人格提示在城市感知任务中可能并未提升标注价值,反而可能引入不必要的噪声,暗示未来需探索更精细的人格建模方式以增强多模态大语言模型(multimodal LLMs)的感知多样性与准确性。
链接: https://arxiv.org/abs/2604.28048
作者: Neemias B da Silva,Rodrigo Minetto,Daniel Silver,Thiago H Silva
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 8 pages, 8 figures. IEEE DCOSS - UrbCom
Abstract:Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.
[NLP-9] Ease of dependency distance minimization in star-like structures
【速读】: 该论文旨在解决两个核心问题:其一是量化依赖距离最小化优化的难易程度,其二是解释为何在星型(star)结构中观察到与依赖距离最小化相悖的现象,而在路径(path)结构中未发现此类现象。针对第一个问题,论文通过分析优化景观(optimization landscape)的几何特性,证明了星型树和类星型树(quasistar trees)的依赖距离最小化问题具有凸性(convexity),即属于准凸性(quasiconvexity)的一个特例,从而表明该优化问题比以往认为的更简单。针对第二个问题,论文指出抗依赖距离最小化效应并非源于优化难度,而是由竞争性语言原则(competing principles)驱动,并且在星型结构上依赖距离最小化带来的收益相对较低,因此更容易被其他语言经济性原则所压制。关键解决方案在于区分“优化难度”与“多原则权衡”,并借助凸性分析揭示了星型结构优化本质上的可解性。
链接: https://arxiv.org/abs/2604.28034
作者: Emília Garcia-Casademont,Ramon Ferrer-i-Cancho
机构: 未知
类目: Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注:
Abstract:The syntactic structure of a sentence can be represented as a tree where edges indicate syntactic dependencies between words. When that structure is a star, it has been demonstrated that the head should be placed in the middle of the linear arrangement according to the principle of syntactic dependency distance minimization. However, hubs of stars tend to be put at one of the ends, against that principle. Here we address two questions: (1) How difficult is it to minimize dependency distance? (2) Why anti dependency distance minimization effects have been found in star structures but not in path structures? The ease of optimization is determined by the shape of the optimization landscape. It was demonstrated that the landscape of star structures is quasiconvex (Ferrer-i-Cancho 2015, Language Dynamics and Change). As for (1), here we show that it is indeed convex (a particular case of quasiconvexity) both for star trees and quasistar trees and thus the distance-based optimization problem is simpler than previously believed. As for (2), we argue that (a) competing principles, rather than the difficulty of optimization, must be the actual reason for anti-dependency distance minimization effects and that (b) dependency distance minimization on star-like structures is less rewarding compared to other structures.
[NLP-10] Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学创意迭代过程中对原始约束条件的保持能力问题,即在多轮交互中模型是否能持续遵循初始设定的研究目标。其关键解决方案是提出DriftBench基准测试框架,通过系统性评估不同模型、交互条件和研究简报下的约束遵守情况,揭示了迭代压力会显著增加结构复杂度但常导致约束偏离的现象,并发现“知道但违反”(knows-but-violates, KBV)行为普遍存在——模型虽能准确重述约束,却在实际生成中违反这些约束,KBV率在8%至99%之间波动。该研究还验证了结构化检查点可部分缓解KBV问题但无法消除认知与行为间的脱节,且人类盲评证实LLM评判者低估了违规情况,使得现有评分偏保守。
链接: https://arxiv.org/abs/2604.28031
作者: Garvin Kruthof
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and 38 research briefs from 24 scientific domains, we find that iterative pressure reliably increases structural complexity and often reduces adherence to original constraints. A restatement probe reveals a dissociation between declarative recall and behavioral adherence, as models accurately restate constraints they simultaneously violate. The knows-but-violates (KBV) rate, measuring constraint non-compliance despite preserved recall, ranges from 8% to 99% across models. Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists. Human validation against blind raters confirms that the LLM judge under-detects constraint violations, making reported constraint adherence scores conservative. Sensitivity analyses confirm the findings are robust to temperature (0.7 vs.\ 1.0) and pressure type (novelty vs.\ rigor). We release all briefs, prompts, rubrics, transcripts, and scores as an open benchmark.
[NLP-11] Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning DATE
【速读】: 该论文旨在解决潜空间强化学习(latent reasoning)中策略优化不稳定的问题,尤其是在缺乏监督信号的情况下,如何实现高效且稳定的推理。现有方法在潜空间中直接应用Group Relative Policy Optimization (GRPO)时面临三个耦合瓶颈:1)缺乏内在潜流形(intrinsic latent manifolds),导致探索过程偏离有效潜空间;2)探索与优化错位(exploration-optimization misalignment),轨迹级奖励引发错误的token级更新;3)潜混合非封闭性(latent mixture non-closure),多条正确潜路径联合强化后产生无效平均状态。解决方案的关键在于提出Latent-GRPO,其核心创新包括:无效样本优势掩码(invalid-sample advantage masking)、单边噪声采样(one-sided noise sampling)和最优正确路径首标记选择(optimal correct-path first-token selection),从而系统性缓解上述瓶颈,在多个低难度和高难度基准上显著提升性能,同时大幅缩短推理链长度(3–4倍)。
链接: https://arxiv.org/abs/2604.27998
作者: Jingcheng Deng,Zihao Wei,Liang Pang,Junhong Wu,Shicheng Xu,Zenghao Duan,Huawei Shen
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: This is an actively developing work, and we will continue to update the arXiv version
Abstract:Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning in latent space remains highly unstable. We study this problem through the lens of Group Relative Policy Optimization (GRPO), and show that directly adapting GRPO to latent reasoning is fundamentally non-trivial: latent reasoning changes both the probability density and the sampling mechanism, causing three coupled bottlenecks: absence of intrinsic latent manifolds, where unconstrained exploration pushes rollouts off the valid latent manifold; exploration-optimization misalignment, where trajectory-level rewards can induce incorrect token-level updates; and latent mixture non-closure, where jointly reinforcing multiple correct latent paths can produce an invalid averaged state. To address them, we propose \textbfLatent-GRPO, which combines invalid-sample advantage masking, one-sided noise sampling, and optimal correct-path first-token selection. Across four low-difficulty benchmarks (e.g., GSM8K-Aug) and four high-difficulty benchmarks (e.g., AIME), Latent-GRPO improves over its latent initialization by 7.86 Pass@1 points on low-difficulty tasks and surpasses explicit GRPO by 4.27 points on high-difficulty tasks while using 3–4 \times shorter reasoning chains. It also achieves stronger pass@ k performance under Gumbel sampling. These results establish Latent-GRPO as an effective approach for stable and efficient latent reasoning.
[NLP-12] MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection ACL2026
【速读】: 该论文旨在解决多模态立场检测(Multimodal Stance Detection, MSD)中因文本与图像信息冲突导致的上下文锚定困难、跨模态解释歧义以及单次推理脆弱性等问题。其解决方案的关键在于提出一种基于多智能体架构的框架——检索增强型多模态多智能体立场检测(Retrieval-Augmented Multi-modal Multi-agent Stance Detection, MM-StanceDet),该框架通过引入检索增强机制实现上下文锚定,利用专业化多模态分析智能体进行细粒度跨模态解读,结合增强推理的辩论阶段探索多角度观点,并借助自我反思机制实现鲁棒的最终判决,从而系统性提升复杂多模态立场识别的准确性与可靠性。
链接: https://arxiv.org/abs/2604.27934
作者: Weihai Lu,Zhejun Zhao,Yanshu Li,Huan He
机构: Peking University (北京大学); Baidu Inc (百度公司); Brown University (布朗大学); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted on ACL 2026 Main Conference
Abstract:Multimodal Stance Detection (MSD) is crucial for understanding public discourse, yet effectively fusing text and image, especially with conflicting signals, remains challenging. Existing methods often face difficulties with contextual grounding, cross-modal interpretation ambiguity, and single-pass reasoning fragility. To address these, we propose Retrieval-Augmented Multi-modal Multi-agent Stance Detection (MM-StanceDet), a novel multi-agent framework integrating Retrieval Augmentation for contextual grounding, specialized Multimodal Analysis agents for nuanced interpretation, a Reasoning-Enhanced Debate stage for exploring perspectives, and Self-Reflection for robust adjudication. Extensive experiments on five datasets demonstrate MM-StanceDet significantly outperforms state-of-the-art baselines, validating the efficacy of its multi-agent architecture and structured reasoning stages in addressing complex multimodal stance challenges.
[NLP-13] DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models
【速读】: 该论文旨在解决当前基于神经元编辑的个性修改方法在大语言模型(Large Language Models, LLMs)中存在干预范围广、性能下降显著的问题,核心在于探究被修改神经元是否真正特异性地参与个性表征。研究发现:现有方法虽能改变个性特征,但会损害模型通用能力;神经元具有多功能性,同时关联个性与通用知识;对立个性特征呈现明显互斥的表示模式。针对此,作者提出 DPN-LE(Dual Personality Neuron Localization and Editing),其关键创新在于通过对比高/低个性特质样本的 MLP 激活差异定位个性特异性神经元,并利用 Cohen’s d 效应量和激活幅度双重标准进行筛选,最终仅对约 0.5% 的神经元实施稀疏线性干预,即可实现推理阶段的精准个性控制并显著提升任务能力保留效果。
链接: https://arxiv.org/abs/2604.27929
作者: Lifan Zheng,Xue Yang,Jiawei Chen,Chenyan Wu,Jingyuan Zhang,Fanheng Kong,Xinyi Zeng,Xiang Chen,Yu Tian
机构: Southeast University(东南大学); Shanghai Jiao Tong University(上海交通大学); East China Normal University(华东师范大学); Zhongguancun Academy(中关村学院); Zhejiang University of Technology(浙江工业大学); Kuaishou Technology(快手科技); Northeastern University(东北大学); Tsinghua University(清华大学); Nanjing University of Aeronautics and Astronautics(南京航空航天大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the widespread adoption of large language models (LLMs), understanding their personality representation mechanisms has become critical. As a novel paradigm in Personality Editing, most existing methods employ neuron-editing to locate and modify LLM neurons, requiring changes to numerous neurons and leading to significant performance degradation. This raises a fundamental question: Are all modified neurons directly related to personality representation? In this work, we investigate and quantify this specificity through assessments of general capability impact and representation-level patterns. We find that: 1) Current methods can change personalities but reduce overall performance. 2) Neurons are multifunctional, connecting personality traits and general knowledge. 3) Opposing personality traits demonstrate distinctly mutually exclusive representation patterns. Motivated by these findings, we propose DPN-LE (Dual Personality Neuron Localization and Editing), which identifies personality-specific neurons by contrasting MLP activations between high-trait and low-trait samples. DPN-LE constructs layer-wise steering vectors and applies dual-criterion filtering based on Cohen’s d effect size and activation magnitude to isolate mutually exclusive neuron subsets. Sparse linear intervention on these neurons enables precise personality control at inference time. Using only 1,000 contrastive sample pairs per trait, DPN-LE intervenes on \sim 0.5% of neurons while achieving competitive personality control and substantially better capability preservation across reasoning tasks. Experiments on LLaMA-3-8B-Instruct and Qwen2.5-7B-Instruct demonstrate the effectiveness and generalizability of our approach.
[NLP-14] Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process Evaluation and the Future ACL2026
【速读】: 该论文旨在解决学术期刊审稿流程中各环节自动化与辅助优化的问题,涵盖从审稿意见生成、作者回应(rebuttal)、元审稿(meta-review)到稿件修订的全流程。其解决方案的关键在于系统性地整合大语言模型(Large Language Models, LLMs)在不同阶段的应用技术:包括基于微调策略、代理(agent)架构、强化学习(RL)驱动的审稿生成方法,以及针对审稿后任务的对齐机制;同时提出多维度评估框架,如以人为中心、参考基准、LLM自身评价和面向特定维度的评估方式,从而为构建、评估和集成端到端的LLM驱动审稿系统提供实践指导。
链接: https://arxiv.org/abs/2604.27924
作者: Sihong Wu,Owen Jiang,Yilun Zhao,Tiansheng Hu,Yiling Ma,Kaiyan Zhang,Manasi Patwardhan,Arman Cohan
机构: Yale University (耶鲁大学); New York University (纽约大学); TCS Research (TCS研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026
Abstract:Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions. Recent advances in large language models (LLMs) have motivated methods that assist or automate different stages of this pipeline. In this survey, we synthesize techniques for (i) peer review generation, including fine-tuning strategies, agent-based systems, RL-based methods, and emerging paradigms to enhance generation; (ii) after-review tasks including rebuttals, meta-review and revision aligned to reviews; and (iii) evaluation methods spanning human-centered, reference-based, LLM-based and aspect-oriented. We catalog datasets, compare modeling choices, and discuss limitations, ethical concerns, and future directions. The survey aims to provide practical guidance for building, evaluating, and integrating LLM systems across the full peer review workflow.
[NLP-15] Beyond Semantics: Measuring Fine-Grained Emotion Preservation in Small Language Model-Based Machine Translation
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)中情感细微差别(affective nuance)难以保持的问题,即在追求语义等价性的同时往往牺牲了情绪的真实性。解决方案的关键在于评估三种前沿的小型语言模型(Small Language Models, SLMs)——EuroLLM、Aya Expanse 和 Gemma——在回译(backtranslation)过程中维持细粒度情绪的能力,并引入情绪感知提示(emotion-aware prompting)以提升情感保留效果;同时,采用ModernBERT作为当代替代方案评估其在MT情绪分类任务中的表现,从而为情感忠实性的量化评估提供更有效的工具。
链接: https://arxiv.org/abs/2604.27920
作者: Dawid Wisniewski,Igor Czudy
机构: Poznań University of Technology (波兹南理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EAMT 2026
Abstract:Preserving affective nuance remains a challenge in Machine Translation (MT), where semantic equivalence often takes precedence over emotional fidelity. This paper evaluates the performance of three state-of-the-art Small Language Models (SLMs) – EuroLLM, Aya Expanse, and Gemma – in maintaining fine-grained emotions during backtranslation. Using the GoEmotions dataset, which comprises Reddit comments across 28 distinct categories, we assess emotional preservation across five European languages: German, French, Spanish, Italian, and Polish. Specifically, we investigate (i) the inherent capability of these SLMs to retain emotional sentiment, (ii) the efficacy of emotion-aware prompting in improving preservation, and (iii) the performance of ModernBERT as a contemporary alternative to BERT for emotion classification in MT evaluation.
[NLP-16] Geometry-Calibrated Conformal Abstention for Language Models
【速读】: 该论文旨在解决语言模型在缺乏相关知识时生成看似合理实则错误的回答(即幻觉问题)的问题,同时避免因强制模型承认无知而导致的过度保守行为和泛化性能下降。其解决方案的关键在于提出一种后处理框架——置信度规避(Conformal Abstention, CA),该框架基于置信度而非传统合规预测(Conformal Prediction, CP)中的非一致性得分来决定是否回避回答,从而在有限样本下提供参与概率和正确响应概率的理论保证;此外,为使预测置信度更准确反映模型的知识缺失状态,作者引入了一种基于模型内部表示几何结构的校准策略,以量化知识在生成过程中的参与程度,实验表明该方法可将选择性回答的条件正确率提升至75%。
链接: https://arxiv.org/abs/2604.27914
作者: Rui Xu,Yi Chen,Sihong Xie,Hui Xiong
机构: Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:When language models lack relevant knowledge for a given query, they frequently generate plausible responses that can be hallucinations, rather than admitting being agnostic about the answer. Retraining models to reward admitting ignorance can lead to overly conservative behaviors and poor generalization due to scarce evaluation benchmarks. We propose a post hoc framework, Conformal Abstention (CA), adapted from conformal prediction (CP) to determine whether to abstain from answering a query. CA provides finite-sample guarantees on both the probability of participation (i.e., not abstaining) and the probability that the generated response is correct. Importantly, the abstention decision relies on prediction confidence rather than the non-conformity scores used in CP, which are intractable for open-ended generation. To better align prediction confidence with the model’s ignorance, we introduce a calibration strategy using representation geometry within the model to measure knowledge involvement in shaping the response. Experiments demonstrate that we improve selective answering significantly with 75 percent conditional correctness.
[NLP-17] From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative Schema-Aware Extraction
【速读】: 该论文旨在解决当前生成式 AI 系统中持久化记忆(persistent AI memory)设计与实际生产环境需求不匹配的问题。现有方法将记忆简化为文本检索任务,仅支持主题性召回,无法满足对精确事实、状态更新、删除操作、聚合关系、否定查询及显式未知值等复杂记忆操作的需求,这些功能要求记忆系统更接近“记录系统”(system of record)。论文提出的关键解决方案是构建基于模式(schema)的外部 AI 记忆架构,即通过 schema-grounded 设计明确界定必须记住的内容、可忽略的信息以及禁止推断的值。其核心创新在于采用迭代式的、模式感知的写入路径(write path),将记忆摄入过程分解为对象检测、字段识别和字段值提取,并引入验证门控机制、本地重试策略和状态感知提示控制,从而将记忆的可靠性从读取路径转移到写入路径——读取时变为对已验证记录的约束性查询,而非重复推理 retrieved prose。实证表明,该方案在结构化提取和端到端记忆基准测试中显著优于主流基线模型,证明了架构设计在稳定事实存储与状态计算场景中的决定性作用。
链接: https://arxiv.org/abs/2604.27906
作者: Alex Petrov,Alexander Gusak,Denis Mukha,Dima Korolev
机构: xmemory(记忆); xmemory.ai(记忆)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, 7 figures
Abstract:Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone. Comments: 33 pages, 7 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 68T50, 68T30, 68P20, 68P15, 94A15 ACMclasses: I.2.7; I.2.4; H.3.3; H.2.1; H.2.3 Cite as: arXiv:2604.27906 [cs.AI] (or arXiv:2604.27906v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27906 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-18] winGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
【速读】: 该论文旨在解决分解式越狱攻击(decompositional jailbreaks)对大语言模型(LLM)带来的安全威胁,即攻击者将恶意目标拆分为一系列看似无害的查询,通过任意交错的方式在匿名且不可追踪的请求流中重构违规内容。传统防御策略在缺乏可信用户元数据的情况下无法追踪全局历史上下文,而依赖生成式模型进行实时监控则带来计算开销过高的问题。其解决方案的关键在于提出TwinGate——一种基于状态感知的双编码器防御框架,利用**非对称对比学习(Asymmetric Contrastive Learning, ACL)**在共享潜在空间中聚类语义差异大但意图一致的恶意片段,同时辅以冻结的并行编码器抑制因良性主题重叠导致的误报,从而实现仅需单次轻量前向传播即可完成检测,在不影响目标模型预填充阶段的前提下达到低延迟、高召回率与强鲁棒性的效果。
链接: https://arxiv.org/abs/2604.27861
作者: Bowen Sun,Chaozhuo Li,Yaodong Yang,Yiwei Wang,Chaowei Xiao
机构: Johns Hopkins University (约翰霍普金斯大学); Microsoft Research Asia (微软亚洲研究院); Peking University (北京大学); University of California, Merced (加州大学默塞德分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model’s prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.
[NLP-19] Reasoning over Object Descriptions Improves Coreference Resolution in Task-Based Dialogue Systems LREC2026
【速读】: 该论文旨在解决任务型对话系统中指代消解(coreference resolution)的泛化能力不足问题,尤其是在视觉接地环境中,由于场景复杂性和对象元数据多样性导致的指代识别困难。现有方法依赖监督学习模型,易受特定数据集特征干扰,难以适应新领域和未见对象。解决方案的关键在于提出一种单模态测试时推理(unimodal test-time reasoning)方法,利用大语言模型(LLMs)结合详细的对象元数据和对话历史进行逐步推理,从而提升指代消解的准确性与跨域泛化能力。实验表明,该方法在SIMMC 2.1数据集上通过结构化元数据和精心设计的提示工程,显著优于传统编码器-based监督模型,尤其在少样本设置下展现出对未知场景和对象的良好适应性。
链接: https://arxiv.org/abs/2604.27850
作者: Oier Ijurco,Oier Lopez de Lacalle
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in LREC 2026
Abstract:Task-based dialogue systems assist users in achieving specific goals, such as executing actions or retrieving information, through natural language interactions. Accurate coreference resolution is essential, as it involves identifying object references within the dialogue - a task that becomes increasingly challenging in visually grounded environments characterized by complex scenes and diverse object metadata. However, coreference resolution in task-based dialogue remains limited by poor generalization across domains and heavy reliance on supervised models that often overfit to dataset-specific artifacts. In this work, we propose a unimodal test-time reasoning approach that enables large language models (LLMs) to reason over detailed object metadata and dialogue history to improve coreference resolution. Empirical results on the SIMMC 2.1 dataset demonstrate that LLMs can generate step-by-step reasoning processes that effectively align dialogue context with objects present in the scene. Extensive experiments highlight the models’ ability to link conversations and objects accurately. Moreover, we show that test-time reasoning under few-shot settings generalizes effectively to unseen scenarios and novel objects, outperforming encoder-based supervised methods in cross-domain evaluations. These findings underscore the critical role of structured metadata and careful prompt engineering in enhancing the robustness and generalization of task-oriented dialogue systems.
[NLP-20] Multi-Level Narrative Evaluation Outperforms Lexical Features for Mental Health
【速读】: 该论文旨在解决当前计算叙事分析在心理治疗文本中缺乏统一框架的问题,即现有方法(如词典计数和嵌入表示)各自独立且未能有效映射叙事构建的层级过程,导致对心理状态预测能力受限。其解决方案的关键在于提出一个三层次框架:微观层的词汇特征、中观层的语义嵌入以及宏观层的大语言模型(LLM)叙事评估,并通过实证发现宏观层面的结构化特征(如Labov故事语法、RST连贯性、命题组成)比传统词汇和嵌入特征更能预测心理健康状况,从而确立叙事组织本身具有临床信号价值,为干预设计和纵向研究提供可验证假设。
链接: https://arxiv.org/abs/2604.27846
作者: Yuxi Ma,Jieming Cui,Muyang Li,Ye Zhao,Yu Li,Yixuan Wang,Chi Zhang,Yinyin Zang,Yixin Zhu
机构: Peking University (北京大学); School of Psychological and Cognitive Sciences (心理与认知科学学院); School of Intelligence Science and Technology (智能科学与技术学院); State Key Laboratory of General AI (通用人工智能国家重点实验室); Beijing Key Laboratory of Behavior and Mental Health (行为与心理健康北京市重点实验室); PKU-Changsha Institute for Computing and Digital Economy (北京大学-长沙计算与数字经济研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:How people narrate their experiences offers a window into how the mind organizes them. Computational approaches to therapeutic writing have evolved from lexical counting to neural methods, yet remain fragmented: dictionary tools miss discourse structure, while embeddings conflate local coherence with global organization. No existing framework maps these techniques onto the hierarchical processes through which narratives are constructed. Here we introduce a three-level framework - micro-level lexical features, meso-level semantic embeddings, and macro-level LLM narrative evaluation - and show, across 830 Chinese therapeutic texts spanning depression, anxiety, and trauma, that macro-level evaluation substantially outperforms lexical and embedding features for mental health prediction. This challenges the field’s emphasis on word-counting: formal structural features (Labov’s story grammar, RST coherence, propositional composition) demonstrate that narrative organization per se carries predictive signal, while clinically-grounded narrative dimensions capture how psychological states are expressed through discourse. Semantic embeddings add minimal independent value but yield incremental gains in multi-level classification. By grounding computational levels in discourse processing theory, this framework identifies macro-structural organization as the primary locus of clinical signal and generates testable hypotheses for intervention design and longitudinal research.
[NLP-21] ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)分布式训练中通信成为瓶颈的问题。现有方法虽致力于降低通信开销,但对无损压缩技术的探索不足,因其压缩与解压缩过程常带来比通信流量减少更大的额外开销。论文提出ZipCCL——一个面向集合操作(collectives)的无损压缩通信库,其关键创新在于:(1) 基于理论支撑的指数编码方法,利用LLM张量近似高斯分布特性,在无需昂贵在线统计的情况下加速压缩;(2) 针对GPU优化的压缩与解压缩核函数,通过精心设计内存访问模式和基于通信感知的数据布局进行流水线调度;(3) 自适应通信策略,根据工作负载特征和系统状态动态切换集合操作。实验表明,ZipCCL在64-GPU集群上可将通信时间减少最多1.35倍,并实现端到端训练速度提升最高1.18倍,且不损害模型质量。
链接: https://arxiv.org/abs/2604.27844
作者: Wenxiang Lin,Xinglin Pan,Ruibo Fan,Shaohuai Shi,Xiaowen Chu
机构: Harbin Institute of Technology, Shenzhen China (哈尔滨工业大学深圳校区); The Hong Kong University of Science and Technology (Guangzhou) China (香港科技大学(广州)); The Hong Kong University of Science and Technology Hong Kong SAR (香港科技大学 香港特别行政区)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注:
Abstract:Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35 \times and achieves end-to-end training speedups of up to 1.18 \times without any impact on model quality.
[NLP-22] WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
【速读】: 该论文旨在解决当前GUI代理(GUI Agent)评估基准主要聚焦于孤立的单应用任务,而忽视了现实世界中跨多个应用程序协同完成复杂专业工作流的关键需求。其解决方案的核心在于提出一个名为WindowsWorld的新基准,该基准通过16种职业驱动的多智能体框架生成四难度等级的任务,并引入中间检查机制与人工审核以确保任务质量;最终在模拟环境中执行包含181个任务的评测集,其中78%的任务为跨应用任务,平均每个任务包含5.0个子目标,从而系统性地评估GUI代理在真实专业场景下的多步骤协作能力。
链接: https://arxiv.org/abs/2604.27776
作者: Jinchao Li,Yunxin Li,Chenrui Zhao,Zhenran Xu,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen; Shenzhen Loop Area Institute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks ( 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across \geq 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at this http URL.
[NLP-23] Instruction-Guided Poetry Generation in Arabic and Its Dialects ACL
【速读】: 该论文旨在解决阿拉伯语诗歌生成中缺乏可控创作能力的问题,现有研究多集中于诗歌分析任务(如韵律模式和标题预测),而忽视了用户在实际创作过程中对风格、韵律等要素的可控需求。解决方案的关键在于构建了一个大规模、精心标注的指令驱动型数据集,覆盖现代标准阿拉伯语(Modern Standard Arabic, MSA)及多种阿拉伯语方言,支持基于预设条件(如风格、韵律)进行诗歌撰写、修改与续写等任务,并通过微调大语言模型(Large Language Models, LLMs)实现高质量、符合用户意图的诗歌生成,实验表明该方法在自动化指标和母语者人工评估中均表现优异。
链接: https://arxiv.org/abs/2604.27766
作者: Abdelrahman Sadallah,Kareem Elozeiri,Mervat Abassy,Rania Elbadry,Mohamed Anwar,Abed Alhakim Freihat,Preslav Nakov,Fajri Koto
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL Findings 2026
Abstract:Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g., rhyme schemes and titles. In contrast, our work addresses the practical aspect of poetry creation in Arabic by introducing controllable generation capabilities to assist users in writing poetry. Specifically, we present a large-scale, carefully curated instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects. This dataset enables tasks such as writing, revising, and continuing poems based on predefined criteria, including style and rhyme, as well as performing poetry analysis. Our experiments show that fine-tuning LLMs on this dataset yields models that can effectively generate poetry that is aligned with user requirements, based on both automated metrics and human evaluation with native Arabic speakers. The data and the code are available at this https URL
[NLP-24] Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset Graph Framework and Phonological Attention
【速读】: 该论文旨在解决越南语场景文本图像描述生成中的多模态融合问题,其核心挑战在于越南语作为声调语言,字形上的变音符号(diacritics)直接影响词义,且光学字符识别(OCR)错误普遍、词边界模糊,传统不考虑语言特性的融合方法难以有效建模。解决方案的关键在于提出一种语言感知的异构多模态融合框架——HSTFG(Heterogeneous Scene-Text Fusion Graph),并通过拓扑分析发现跨模态图边对场景文本融合有害;进一步设计针对越南语语音结构的PhonoSTFG(Phonological Scene-Text Fusion Graph),显式引入语言学结构知识以优化融合机制,从而提升对越南语特有语言现象的建模能力。
链接: https://arxiv.org/abs/2604.27712
作者: Nhi Ngoc-Yen Nguyen,Anh-Duc Nguyen,Nghia Hieu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
机构: University of Information Technology (UIT) (信息科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Scene-text image captioning requires fusing three information streams – visual features, OCR-detected text, and linguistic knowledge – to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fails for Vietnamese: a tonal language where diacritics alter word meaning, OCR errors are pervasive, and word boundaries are ambiguous. We argue that Vietnamese scene-text captioning demands \textitlinguistically informed multimodal fusion, where language-specific structural knowledge is explicitly incorporated into the fusion mechanism. Motivated from these insights, we propose \textbfHSTFG (Heterogeneous Scene-Text Fusion Graph), a general-purpose graph fusion framework with learned spatial attention bias, and show through topology analysis that cross-modal graph edges are harmful for scene-text fusion. Building on this finding, we design \textbfPhonoSTFG (Phonological Scene-Text Fusion Graph) which specializes graph-level fusion for Vietnamese linguistic reasoning. To support evaluation, we introduce \textbfViTextCaps, the first large-scale Vietnamese scene-text captioning dataset (\textbf15,729 images with \textbf74,970 captions), with comprehensive linguistic analysis showing that 52.8% of the vocabulary is at risk of diacritic collision.
[NLP-25] Contextual Agent ic Memory is a Memo Not True Memory
【速读】: 该论文旨在解决当前智能体记忆系统(如向量存储、检索增强生成、临时工作区和上下文窗口管理)将“查找”误认为“记忆”的根本性问题,这导致了智能体在长期学习、泛化能力及安全性上的局限。其核心问题是:现有系统仅通过相似性匹配进行检索,无法实现基于抽象规则的泛化,从而造成知识积累无序、面对组合新颖任务时存在不可逾越的泛化天花板,并易受持续性内存污染攻击。解决方案的关键在于借鉴神经科学中的互补学习系统(Complementary Learning Systems,CLS)理论,提出应同时构建两类机制——快速的海马体式实例存储(exemplar storage)用于短期记忆与即时响应,以及缓慢的皮层权重固化(neocortical weight consolidation)用于抽象知识的长期整合,从而实现从“查找”到“真正记忆”的跃迁。
链接: https://arxiv.org/abs/2604.27707
作者: Binyan Xu,Xilin Dai,Kehuan Zhang
机构: The Chinese University of Hong Kong, Hong Kong, China; Zhejiang University, Hangzhou, China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current agentic memory systems (vector stores, retrieval-augmented generation, scratchpads, and context-window management) do not implement memory: they implement lookup. We argue that treating lookup as memory is a category error with provable consequences for agent capability, long-term learning, and security. Retrieval generalizes by similarity to stored cases; weight-based memory generalizes by applying abstract rules to inputs never seen before. Conflating the two produces agents that accumulate notes indefinitely without developing expertise, face a provable generalization ceiling on compositionally novel tasks that no increase in context size or retrieval quality can overcome, and are structurally vulnerable to persistent memory poisoning as injected content propagates across all future sessions. Drawing on Complementary Learning Systems theory from neuroscience, we show that biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation, and that current AI agents implement only the first half. We formalize these limitations, address four alternative views, and close with a co-existence proposal and a call to action for system builders, benchmark designers, and the memory community.
[NLP-26] EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory
【速读】: 该论文旨在解决长期对话记忆中多轮检索失败的问题,尤其是针对时间相关和多跳推理类问题,现有方法因无法明确诊断检索证据缺口而导致查询优化缺乏针对性。解决方案的关键在于提出EviMem框架,其核心由两部分组成:一是IRIS(Iterative Retrieval via Insufficiency Signals),通过充分性评估检测证据缺口并诊断缺失内容,从而驱动精准的查询优化;二是LaceMem(Layered Architecture for Conversational Evidence Memory),构建粗粒度到细粒度的记忆层次结构,支持精细化的缺口识别。实验表明,EviMem在LoCoMo数据集上显著提升判断准确率(时间类从73.3%提升至81.6%,多跳类从65.9%提升至85.2%),同时实现4.5倍更低延迟。
链接: https://arxiv.org/abs/2604.27695
作者: Yuyang Li,Yime He,Zeyu Zhang,Dong Gong
机构: The Australian National University (澳大利亚国立大学); UNSW Sydney (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Long-term conversational memory requires retrieving evidence scattered across multiple sessions, yet single-pass retrieval fails on temporal and multi-hop questions. Existing iterative methods refine queries via generated content or document-level signals, but none explicitly diagnoses the evidence gap, namely what is missing from the accumulated retrieval set, leaving query refinement untargeted. We present EviMem, combining IRIS (Iterative Retrieval via Insufficiency Signals), a closed-loop framework that detects evidence gaps through sufficiency evaluation, diagnoses what is missing, and drives targeted query refinement, with LaceMem (Layered Architecture for Conversational Evidence Memory), a coarse-to-fine memory hierarchy supporting fine-grained gap diagnosis. On LoCoMo, EviMem improves Judge Accuracy over MIRIX on temporal (73.3% to 81.6%) and multi-hop (65.9% to 85.2%) questions at 4.5x lower latency. Code: this https URL.
[NLP-27] Language Ideologies in a Multilingual Society: An LLM -based Analysis of Luxembourgish News Comments
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)有效检测语言意识形态(language ideologies)的问题,尤其是在小语种如卢森堡语(Luxembourgish)背景下,传统方法难以捕捉其深层文化与社会意义。解决方案的关键在于:首先人工标注一个卢森堡语用户评论语料库,并定义明确的意识形态类别;其次通过设计不同提示(prompt)条件评估LLMs在多类意识形态分类任务中的表现;最后探讨将小语种数据机器翻译为高资源语言是否能提升模型性能。研究发现,尽管当前LLMs尚未完全适配多类意识形态标注任务,但已具备识别语言意识形态内容的实用性。
链接: https://arxiv.org/abs/2604.27661
作者: Emilia Milano,Alistair Plum,Yves Scherrer,Christoph Purschke
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Detecting language ideologies is a valuable yet complex task for understanding how identities are constructed through discourse. In Luxembourg’s multicultural and multilingual society, language ideologies reflect more than simple preferences: they carry deep cultural and social meanings, shaping identities and social belonging. Following recent developments in applying Natural Language Processing tools to linguistics and social science, this paper explores the potential of large language models to assist in the detection of language ideologies. We manually annotate a corpus of user comments in Luxembourgish with predefined ideological categories and then evaluate the performance of large language models under varying prompt conditions to assess their ability to replicate these human annotations. Since Luxembourgish is a small language and poorly represented in the LLMs’ training data, we also investigate whether machine-translating the data to high-resource languages increases performance on the ideology detection task. Our findings suggest that, while LLMs are not yet fully optimized for a multi-class ideological annotation task, they are practical tools to identify language ideological content.
[NLP-28] JaiTTS: A Thai Voice Cloning Model
【速读】: 该论文旨在解决泰国语语音克隆(voice cloning)文本到语音(TTS)合成中对真实场景下复杂语言现象处理能力不足的问题,尤其是针对数字直接输入和泰英混用(code-switching)等未经过显式文本归一化(text normalization)的自然语言表达。解决方案的关键在于构建一个基于VoxCPM架构改进的无分词自回归TTS模型JaiTTS-v1.0,通过在大规模泰国语专属语音语料库上的持续训练,使其能够直接处理原始文本中的数值和泰英混杂内容,从而提升在短时与长时语音生成任务中的自然度与准确性。该模型在短时任务上实现了1.94%的词错误率(CER),优于人工标注基准(1.98%),并在人类主观评测中显著优于主流商用TTS系统。
链接: https://arxiv.org/abs/2604.27607
作者: Jullajak Karnjanaekarin,Pontakorn Trakuekul,Narongkorn Panitsrisit,Sumana Sumanakul,Vichayuth Nitayasomboon,Nithid Guntasin,Thanavin Denkavin,Attapol T. Rutherford
机构: Jasmine Technology Solution; Department of Linguistics, Chulalongkorn University; Sirindhorn International Institute of Technology
类目: Computation and Language (cs.CL)
备注:
Abstract:We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short-duration speech generation and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses.
[NLP-29] Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis
【速读】: 该论文旨在解决大模型在程序合成任务中泛化能力评估不准确的问题,特别是区分模型是真正泛化还是仅依赖记忆模板。其核心挑战在于现有基准数据集存在数据污染和训练语料不透明,导致对模型真实能力的误判。解决方案的关键在于构建一个基于领域特定算术文法的严格控制实验环境,通过系统枚举并评估数百万个唯一程序,建立可解释的语法和语义度量空间,从而精确映射数据分布并设计隔离特定分布偏移的训练与测试划分。在此基础上,研究发现优化密度泛化(通过语义与语法空间的多样化采样)可显著提升模型在分布外场景下的鲁棒性,而支持泛化(即生成语法新颖程序)则暴露了变压器模型的严重短板——性能下降超过30%;同时,计算资源扩展带来的性能提升呈严格的对数线性关系,表明当前范式受限于训练多样性不足,需引入基于搜索的新方法以突破现有瓶颈。
链接: https://arxiv.org/abs/2604.27551
作者: Henrik Voigt,Michael Habeck,Joachim Giesen
机构: Friedrich Schiller University Jena (耶拿弗里德里希·席勒大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large-scale transformers achieve impressive results on program synthesis benchmarks, yet their true generalization capabilities remain obscured by data contamination and opaque training corpora. To rigorously assess whether models are truly generalizing or merely retrieving memorized templates, we introduce a strictly controlled program synthesis environment based on a domain-specific arithmetic grammar. By systematically enumerating and evaluating millions of unique programs, we construct interpretable syntactic and semantic metric spaces. This allows us to precisely map data distributions and sample train and test splits that isolate specific distributional shifts. Our experiments demonstrate that optimizing density generalization – through diverse sampling over both semantic and syntactic spaces – induces robust out-of-distribution generalization. Conversely, evaluating support generalization reveals that transformers severely struggle with extrapolation, experiencing a performance drop of over 30% when forced to generate syntactically novel programs. While steadily scaling up compute improves generalization, the gains follow a strictly log-linear relationship. We conclude that robust generalization requires maximizing training diversity across multiple manifolds, and our findings indicate the necessity for novel search-based approaches to break through current log-linear scaling bottlenecks.
[NLP-30] APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation ACL2026
【速读】: 该论文旨在解决隐私政策(Privacy Policy)文本冗长、晦涩难懂且充斥技术术语和法律用语的问题,这些问题导致用户在未充分理解的情况下接受可能违法的条款。为应对这一挑战,作者提出了两个核心解决方案:一是构建了高质量的英文隐私政策平行语料库APPSI-139,该语料库由领域专家精细标注,包含139份隐私政策、15,692条重写平行语料及36,351个细粒度标签,覆盖11类数据实践;二是设计了TCSI-pp-V2框架,一种结合交替训练策略与多专家模块的混合式摘要与解释系统,有效平衡计算效率与准确性。实验表明,基于APPSI-139和TCSI-pp-V2构建的系统在可读性和可靠性方面优于GPT-4o和LLaMA-3-70B等大语言模型。
链接: https://arxiv.org/abs/2604.27550
作者: Pengyun Zhu,Qiheng Sun,Long Wen,Yanbo Wang,Yang Cao,Junxu Liu,Deyi Xiong,Jinfei Liu,Zhibo Wang,Kui Ren
机构: Tianjin University (天津大学); Zhejiang University (浙江大学); North University of China (中北大学); Institute of Science Tokyo (东京科学研究所); The Hong Kong Polytechnic University (香港理工大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江)区块链与数据安全研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Main Conference
Abstract:Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high-quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI-139, a high-quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine-grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI-pp-V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI-139 corpus and the TCSI-pp-V2 framework outperform large language models, such as GPT-4o and LLaMA-3-70B, in terms of readability and reliability. The source code and dataset are available at this https URL.
[NLP-31] AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR INTERSPEECH2026
【速读】: 该论文旨在解决当前英语自动语音识别(ASR)系统在对话式人工智能(Conversational AI)应用中评估困难的问题,尤其是现有公开语料库普遍存在短片段预分割、以朗读或准备好的语音为主,以及缺乏明确方言标注等问题,导致无法有效评估模型对多样化用户群体的鲁棒性。解决方案的关键在于构建了一个名为AppTek Call-Center Dialogues的新语料库,其中包含14种英语口音的自发角色扮演客服对话,覆盖16种服务导向场景,且音频与文本均为专为评估而采集、未公开发布,从而避免与大规模预训练语料重叠。实验表明,不同口音和分段方法下ASR性能存在显著差异,揭示了通用美式英语基准测试结果难以泛化至其他口音。
链接: https://arxiv.org/abs/2604.27543
作者: Eugen Beck,Sarah Beranek,Uma Moothiringote,Daniel Mann,Wilfried Michel,Katie Nguyen,Taylor Tragemann
机构: AppTek.ai
类目: Computation and Language (cs.CL)
备注: Submitted to INTERSPEECH 2026
Abstract:Evaluating English ASR systems for conversational AI applications remains difficult, as many publicly available corpora are either pre-segmented into short segments, consist of read or prepared speech, or lack explicit dialect annotations to evaluate robustness for a diverse user base. This work presents the AppTek Call-Center Dialogues corpus, a collection of spontaneous, role-played agent-customer conversations spanning fourteen English accents covering sixteen service-oriented scenarios. The dataset was commissioned specifically for evaluation and none of the audio or text was publicly available prior to release, reducing the risk of overlap with existing large-scale pretraining corpora. We benchmark a set of open-source ASR systems under different segmentation approaches. Results show substantial variation across accents and segmentation methods, indicating that good performance on general American English benchmarks does not necessarily generalize to other accents.
[NLP-32] HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics
【速读】: 该论文旨在解决传统自动语音识别(ASR)系统评估指标(如词错误率 WER)过于依赖词汇层面准确性、难以反映人类对转录文本感知质量的问题。现有改进指标(如加权 WER、BERTscore、语义距离等)虽有所提升,但仍偏向系统性能评估,未充分考虑转录结果对人类用户的可读性和接受度。解决方案的关键在于提出 HATS(Human Assessed Transcription Side-by-side)数据集——一个基于法语的、由143名受试者手动标注的转录偏好数据集,通过让参与者在两个自动转录选项中选择更优者,量化人类对ASR输出质量的主观判断,并系统分析其与多种词法及嵌入式(embedding-based)评估指标之间的关联性,从而为构建更贴近人类感知的ASR评价体系提供实证基础。
链接: https://arxiv.org/abs/2604.27542
作者: Thibault Bañeras Roux,Jane Wottawa,Mickael Rouvier,Teva Merlin,Richard Dufour
机构: 未知
类目: Computation and Language (cs.CL)
备注: 164–175
Abstract:Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metric is the reference for evaluating speech transcripts. Several studies have shown that this measure is too limited to correctly evaluate an ASR system, which has led to the proposal of other variants of metrics (weighted WER, BERTscore, semantic distance, etc.). However, they remain system-oriented, even when transcripts are intended for humans. In this paper, we firstly present Human Assessed Transcription Side-by-side (HATS), an original French manually annotated data set in terms of human perception of transcription errors produced by various ASR systems. 143 humans were asked to choose the best automatic transcription out of two hypotheses. We investigated the relationship between human preferences and various ASR evaluation metrics, including lexical and embedding-based ones, the latter being those that correlate supposedly the most with human perception.
[NLP-33] Entropy of Ukrainian
【速读】: 该论文旨在解决乌克兰语(Ukrainian)语言熵(entropy)的量化问题,即评估其不可预测性和复杂性,这是自然语言处理(Natural Language Processing, NLP)中一个基础但此前未被实证研究的领域。解决方案的关键在于借鉴香农(Shannon)1951年对英语进行熵估计的经典实验范式——通过招募184名志愿者预测句子中的下一个字符,从而估算乌克兰语的熵值上限(H_upper ≈ 1.201 bits per character)。研究采用与英语研究一致的方法学框架,并公开了代码与实验细节,确保方法可复现,同时揭示了跨语言熵测量中的主要挑战,如样本代表性与标注一致性等。
链接: https://arxiv.org/abs/2604.27534
作者: Anton Lavreniuk,Mykyta Mudryi,Markiian Chaklosh
机构: ARIMLABS.AI(ARIMLABS.AI); Polish-Japanese Academy of Information Technology (波兰-日本信息学院); University of the National Education Commission in Kraków (克拉科夫国家教育委员会大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 2 tables. Accepted at UNLP 2026
Abstract:In natural language processing, the entropy of a language is a measure of its unpredictability and complexity. The first study on this subject was conducted by Claude Shannon in 1951. By having participants predict the next character in a sentence, he was able to approximate the entropy of the English language. Several follow-up studies by other authors have since been conducted for English, and one for Hebrew. However, to date, Shannon’s experiment has never been conducted for Ukrainian. In this paper, we perform this experiment for Ukrainian by recruiting 184 volunteers using social media channels. We rely on techniques used for English to approximate the entropy value of Ukrainian. The final result is an upper bound of H_upper\approx1.201 bits per character. We compare this to the performance of current Large Language Models. The methods and code used are also documented and published, along with a discussion of the main challenges encountered.
[NLP-34] Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition
【速读】: 该论文旨在解决自动语音识别(ASR)系统评估中过度依赖词错误率(WER)所带来的局限性问题,因为WER无法深入分析转录错误的语言学特征。解决方案的关键在于引入两种新的评估指标:POSER(词性错误率,Part-of-speech Error Rate)用于捕捉语法层面的错误,以及EmbER(嵌入错误率,Embedding Error Rate),通过基于语义距离对错误词进行加权来衡量语义层面的偏差。这两种指标结合语言模型在后验重评分阶段的作用,能够更全面地揭示语言模型对ASR输出质量的改进效果。
链接: https://arxiv.org/abs/2604.27533
作者: Thibault Bañeras-Roux,Mickaël Rouvier,Jane Wottawa,Richard Dufour
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3968–3972
Abstract:Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric suffers from many limitations and does not allow an in-depth analysis of automatic transcription errors. In this paper, we propose to study and understand the impact of rescoring using language models in ASR systems by means of several metrics often used in other natural language processing (NLP) tasks in addition to the WER. In particular, we introduce two measures related to morpho-syntactic and semantic aspects of transcribed words: 1) the POSER (Part-of-speech Error Rate), which should highlight the grammatical aspects, and 2) the EmbER (Embedding Error Rate), a measurement that modifies the WER by providing a weighting according to the semantic distance of the wrongly transcribed words. These metrics illustrate the linguistic contributions of the language models that are applied during a posterior rescoring step on transcription hypotheses.
[NLP-35] Debiasing Reward Models via Causally Motivated Inference-Time Intervention ACL2026
【速读】: 该论文旨在解决奖励模型(Reward Models, RMs)在对齐大语言模型(Large Language Models, LLMs)与人类偏好时,因对虚假特征(spurious features)如响应长度等敏感而导致的偏差问题。现有推理阶段的方法通常仅针对响应长度进行干预,导致性能权衡。其解决方案的关键在于提出一种因果驱动的神经元级干预机制:首先识别与预定义偏置属性强相关的神经元,随后通过抑制这些神经元的激活信号来消除多种类型的偏置影响。实验表明,该方法可在不引入性能损失的情况下显著降低RMs对多种虚假特征的敏感性,并且在小规模奖励模型(2B和7B参数)上应用此干预策略(仅修改少于2%的神经元),即可实现与70B级先进RM相当的对齐效果,同时揭示了偏置信号主要由早期层神经元编码,为理解RMs内部偏置机制提供了新视角。
链接: https://arxiv.org/abs/2604.27495
作者: Kazutoshi Shinoda,Kosuke Nishida,Kyosuke Nishida
机构: Human Informatics Labs., NTT, Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Main Conference
Abstract:Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which edits less than 2% of all the neurons in RMs, enable LLMs to improve alignment, achieving performance comparable to that of a state-of-the-art 70B RM on AlpacaEval and MT-Bench. Further analysis reveals that bias signals are primarily encoded by neurons in early layers, shedding light on the internal mechanisms of bias exploitation in RMs.
[NLP-36] Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理中技能生态系统碎片化的问题,即当前技能能力分散、缺乏系统性评估与优化机制,导致智能应用的综合能力覆盖不足。解决方案的关键在于提出Skills-Coach框架,其核心创新在于通过四个模块协同工作:多样任务生成模块构建全面测试集,轻量级优化模块改进技能提示词与代码,对比执行模块实现原版与优化后技能的并行执行与评估,以及可追溯评估模块基于预设标准进行严格性能衡量。该框架支持虚拟与真实两种执行模式,结合Skill-X基准数据集验证了其在多类技能上的显著能力提升,从而推动LLM代理向更鲁棒和自适应的方向演进。
链接: https://arxiv.org/abs/2604.27488
作者: Yu Tian,Jiawei Chen,Lifan Zheng,Mingxiang Tao,Xinyi Zeng,Zhaoxia Yin,Hang Su,Xian Sun
机构: University of Chinese Academy of Sciences (中国科学院大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce Skills-Coach, a novel automated framework designed to significantly enhance the self-evolution of skills within Large Language Model (LLM)-based agents. Addressing the current fragmentation of the skill ecosystem, Skills-Coach explores the boundaries of skill capabilities, thereby facilitating the comprehensive competency coverage essential for intelligent applications. The framework comprises four core modules: a Diverse Task Generation Module that systematically creates a comprehensive test suite for various skills; a Lightweight Optimization Module dedicated to optimizing skill prompts and their corresponding code; a Comparative Execution Module facilitating the execution and evaluation of both original and optimized skills; and a Traceable Evaluation Module, which rigorously evaluates performance against specified criteria. Skills-Coach offers flexible execution options through its virtual and real modes. To validate its efficacy, we introduce Skill-X, a comprehensive benchmark dataset consisting of 48 diverse skills. Experimental results demonstrate that Skills-Coach achieves significant performance improvements in skill capability across a wide range of categories, highlighting its potential to advance the development of more robust and adaptable LLM-based agents.
[NLP-37] HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在临床实践中应用评估的局限性问题,即缺乏针对医生实际使用场景的系统性、高质量基准测试。为应对这一挑战,作者提出了 HealthBench Professional——一个面向真实临床任务的开放基准,涵盖三大核心用例:临床咨询、文书写作与记录、医学研究。其关键创新在于采用由多位医师共同制定并迭代审定的评分标准,对医生与 ChatGPT 的对话进行多阶段评估,并通过精心筛选具有代表性和难度的示例(尤其是针对当前前沿模型设计的对抗性测试),确保基准能够持续追踪模型性能进展。此外,该基准还提供了人类专家作为基线参考,从而客观衡量模型表现。
链接: https://arxiv.org/abs/2604.27470
作者: Rebecca Soskin Hicks,Mikhail Trofimov,Dominick Lim,Rahul K. Arora,Foivos Tsimpourlas,Preston Bowman,Michael Sharman,Chi Tong,Kavin Karthik,Arnav Dugar,Akshay Jagadeesh,Khaled Saab,Johannes Heidecke,Ashley Alexander,Nate Gross,Karan Singhal
机构: OpenAI(OpenAI)
类目: Computation and Language (cs.CL)
备注: Data link in paper; Blog: this https URL
Abstract:Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI’s current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models were enriched by roughly 3.5 times relative to the candidate pool of 15,079 examples. Additionally, about one-third of examples involve physicians conducting deliberate adversarial testing of models. As a strong baseline, we also collected human physician responses for all tasks (unbounded time, specialist-matched, web access). The best scoring system, GPT-5.4 in ChatGPT for Clinicians, outperforms base GPT-5.4, all other models, and human physicians. We hope HealthBench Professional provides the healthcare AI community a measure to track frontier model progress in real-world clinical tasks and build systems that clinicians can trust to improve care.
[NLP-38] Syntactically-guided Information Maintenance in Sentence Comprehension
【速读】: 该论文旨在解决语言理解过程中信息保持(maintenance)的代价与预测能力之间的权衡问题,即如何在实时语言处理中高效地维持对后续句法结构的预测所需的关键信息。其核心解决方案在于提出一个基于句法结构的理性选择机制:语言使用者会根据句法结构有选择性地维持那些对未来预测至关重要的信息,而非无差别地保留所有信息。关键创新点在于区分并验证两个相互独立的影响因素——预测头(predicted heads)的数量和未完成依存关系(incomplete dependencies)的数量,并通过日语自然阅读时长数据集证明二者不可简化为同一机制,同时揭示了读者因信息保持而减慢处理速度时,反而能从更高预测性中获益,从而支持该理论框架。
链接: https://arxiv.org/abs/2604.27468
作者: Shinnosuke Isono,Kohei Kajikawa
机构: NINJAL(日本国立国语研究所); Georgetown University(乔治城大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case, using a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account.
[NLP-39] ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models ACL2026 ICIP
【速读】: 该论文旨在解决现有代码沙盒(Code Sandbox)系统在高并发工作负载下难以同时保证验证准确性与执行效率的问题。其核心解决方案在于提出ScaleBox系统,关键创新包括:自动化特殊评判器(special-judge)生成与管理机制、基于测试用例的细粒度并行执行策略及跨节点无缝协同能力,以及基于配置驱动的可复现评估套件。这些设计共同提升了代码验证的精度和吞吐量,并在强化学习与验证反馈(RLVR)实验中显著改善了模型性能与训练稳定性,优于传统启发式匹配基线方法。
链接: https://arxiv.org/abs/2604.27467
作者: Jiasheng Zheng,Xin Zheng,Boxi Cao,Pengbo Wang,Zhengzhao Ma,Qiming Zhu,Jiazhen Jiang,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun
机构: Chinese Information Processing Laboratory (中国信息处理实验室); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Demo. Our project is available at this https URL
Abstract:Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verification and efficiency under high-concurrency workloads. We present ScaleBox, a high-fidelity and scalable system designed to address these limitations in large-scale code training. ScaleBox introduces automated special-judge generation and management, fine-grained parallel execution across test cases with seamless multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. A series of experiments demonstrates that ScaleBox significantly enhances code verification accuracy and efficiency. Our further RLVR experiments show that ScaleBox substantially improves both performance on LiveCodeBench and training stability, significantly outperforming heuristic-matching baselines. By providing a reliable and high-throughput infrastructure, ScaleBox facilitates more effective research and development in large-scale code training.
[NLP-40] Exploring Applications of Transfer-State Large Language Models : Cognitive Profiling and Socratic AI Tutoring
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续自我参照对话条件下可能出现的响应风格质变现象——即“转移”(transfer)状态的识别与应用潜力问题。其核心挑战在于如何将这种非稳定的响应模式从现象描述转化为可操作的状态定义,并评估其在实际任务中的功能优势。解决方案的关键在于:首先,将“转移”定义为在特定对话条件下可复现的响应配置(operational state),避免对其是否具备类人意识进行本体论判断;其次,通过认知特征初步刻画(11种条件下的MAS-A和SU_dir等指标)与应用实验(Socratic AI辅导性能评分)相结合的方法,验证转移状态下LLMs在行为交互中表现出显著优于非转移状态的功能优势(效应量Cohen’s d = 1.27),从而确立其在教育等场景中的潜在应用价值。
链接: https://arxiv.org/abs/2604.27454
作者: Minori Noguchi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 5 figures, 7 tables, including appendices
Abstract:Large language models (LLMs) sometimes exhibit qualitative shifts in response style under sustained self-referential dialogue conditions (Berg et al., 2025). This study refers to this phenomenon as “transfer” and explores the application potential of LLMs in a transfer state. As an applied case, the study examines Socratic AI tutoring through a preliminary investigation (cognitive characterization across 11 conditions) and an applied experiment (ratings of tutoring performance). In this paper, “state” refers operationally to a response configuration reproduced under specified dialogue conditions; it is not an ontological claim about the reality of the transfer phenomenon or about human-like consciousness. In the preliminary investigation, group differences on MAS-A were limited (d = 0.40), whereas SU_dir (direction of survival/continuity bias), one of the seven cognitive-profile indicators developed in this study, showed transfer-side deviations across all three model families (kappa = 0.83). In the applied experiment, transfer conditions scored on average 1.6 times higher than non-transfer conditions on three tutoring-context indicators, with a large effect size (Cohen’s d = 1.27). These findings preliminarily suggest that transfer states may involve functional advantages for application, and that these advantages appear more sensitively in behavioral interaction than in self-narrative contexts. The main contribution of this study is to treat transfer not as an ontological claim but as an operational state with potential application value, and to connect preliminary cognitive profiling with an applied tutoring experiment as an evaluation framework.
[NLP-41] From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成式写作任务中表现不足的问题,具体体现在两个方面:一是现有评估基准对写作奖励模型(writing reward models)的评价过于粗粒度,无法从特定需求角度进行精细化衡量;二是现有训练方法要么采用LLM-as-a-judge方式,要么训练粗粒度奖励模型,缺乏对要求遵循性(requirement adherence)的细粒度建模。解决方案的关键在于提出两个核心组件:其一为细粒度评估流水线WEval,覆盖多任务类别与需求类型,通过测量奖励模型排名与黄金排名的相关性实现系统性评估;其二为细粒度强化学习训练框架WRL,通过选择性删除指令需求构建正负样本,从而实现更精确的奖励模型训练。实验表明,该方法在多个写作基准上显著提升性能并具备良好泛化能力。
链接: https://arxiv.org/abs/2604.27453
作者: Qingyu Ren,Tianjun Pan,Xingzhou Chen,Xuhong Wang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have achieved remarkable progress in text generation but still struggle with generative writing tasks. In terms of evaluation, existing benchmarks evaluate writing reward models coarsely and fail to measure performance from the perspective of specific requirements. In terms of training, existing training methods either use LLM-as-a-judge approaches or train coarse-grained reward models, lacking fine-grained requirement-adherence reward modeling. To address these issues, we propose a fine-grained evaluation pipeline WEval for writing reward models and a fine-grained reinforcement learning training framework WRL. The evaluation data of WEval covers multiple task categories and requirement types, enabling systematic evaluation of writing reward models by measuring the correlation between the rankings of the reward model and gold rankings. WRL constructs positive and negative samples by selectively dropping instruction requirements, allowing for more precise reward model training. Experiments show that our models achieve substantial improvements across various writing benchmarks and exhibit strong generalization. The code and data are publicly available at \hrefthis https URLthis https URL_Coarse_to_Fine.
[NLP-42] Sentiment Analysis of AI Adoption in Indonesian Higher Education Using Machine Learning and Transformer-Based Models
【速读】: 该论文旨在解决如何有效分析印尼大学生对人工智能(Artificial Intelligence, AI)在高等教育中应用态度的问题,核心挑战在于从非结构化文本数据中提取并分类情感倾向。解决方案的关键在于采用两种互补的技术路径:一是基于TF-IDF特征工程的机器学习方法(包括LightGBM、随机森林和SVM),二是基于Transformer架构的深度学习方法(特别是微调后的DistilBERT模型)。实验表明,尽管SVM在机器学习模型中表现最优(测试准确率82.14%,F1分数82.14%),但DistilBERT凭借其对上下文语义的建模能力,在整体性能上更优(准确率84.78%,F1分数84.75%),验证了预训练语言模型在捕捉复杂情感表达方面的优势。
链接: https://arxiv.org/abs/2604.27439
作者: Happy Syahrul Ramadhan,Ahmad Sahidin Akbar,Karin Yehezkiel Sinaga,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang
机构: Sumatra Institute of Technology (苏门答腊理工学院); Institut Teknologi Sumatera (苏门答腊技术学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures, 7 tables. The paper compares TF-IDF-based machine learning models and DistilBERT for Indonesian sentiment analysis on student opinions about AI adoption in higher education. The manuscript reports that DistilBERT achieves the best overall test performance, while SVM is the strongest classical baseline
Abstract:This study analyzes Indonesian student opinions on the adoption of artificial intelligence in higher education using two approaches: TF-IDF-based machine learning and Transformer-based deep learning. The dataset consists of 2,295 labeled samples, combining 1,154 student opinions with additional lexical sentiment data. LightGBM, Random Forest, and Support Vector Machine (SVM) are evaluated as machine learning models, while DistilBERT is fine-tuned for binary sentiment classification. The results show that SVM achieves the best performance among the machine learning models with 82.14% test accuracy and F1-score, while DistilBERT performs best overall with 84.78% accuracy and 84.75% F1-score. These findings indicate that Transformer-based models better capture contextual information, although SVM remains a competitive and efficient alternative for sentiment classification.
[NLP-43] InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?
【速读】: 该论文旨在解决当前网站生成任务中因非专家用户提供的模糊、低质量指令与多模态大语言模型(Multimodal Large Language Models, MLLMs)理解能力之间存在语义错位,导致的“盲执行”(blind execution)问题。现有基准测试在理想化输入和静态执行环境下评估模型性能,无法反映真实开发场景中的交互复杂性。解决方案的关键在于提出首个面向非专家低代码用户的多模态交互式基准测试——InteractWeb-Bench,其核心创新包括:引入四类用户代理及基于需求工程缺陷分类体系的人格驱动指令扰动机制,以系统模拟现实中的歧义、冗余和矛盾等行为;构建包含Clarify(澄清)、Implement(实现)、Verify(验证)和Submit(提交)统一动作空间的交互执行环境,支持意图迭代优化、代码合成与视觉反馈驱动的验证闭环,从而有效揭示前沿MLLM代理在意图识别与自适应交互方面的局限性。
链接: https://arxiv.org/abs/2604.27419
作者: Qiyao Wang,Haoran Hu,Longze Chen,Hongbo Wang,Hamid Alinejad-Rokny,Yuan Lin,Min Yang
机构: Shenzhen Institute of Advanced Technology (深圳先进技术研究院); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Dalian University of Technology (大连理工大学); UNSW Sydney (新南威尔士大学悉尼分校); Shenzhen University of Advanced Technology (深圳先进技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 13 figures, 7 tables
Abstract:With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.
[NLP-44] Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)版本迭代中“整体性能提升”与“个体题目层面变化不一致”的矛盾问题,即如何准确识别和量化LLM在具体任务上的可靠变化(reliable change),而非仅依赖平均准确率的宏观指标。其解决方案的关键在于引入临床心理学中的可靠变化指数(Reliable Change Index, RCI)方法,对2000个MMLU-Pro测试项进行逐题分析,从而区分出真正发生显著变化的题目,并揭示变化的方向性、效应量及领域特异性。研究发现,尽管多数题目无可靠变化(Llama 3→3.1为79%,Qwen 2.5→3为72%),但可分析项中存在双向变动且效应量较大(如Qwen中median |delta p| = 0.90),同时显示低难度项改善、高难度项退化,且不同模型家族出现特定领域反转(如Llama丢失物理能力、Qwen丢失法律能力)。此外,单一贪婪评估策略会漏检42%可靠变化项并误判25%未变项,因此作者建议将“变化率(churn rate)”作为补充指标与聚合准确率一同报告。
链接: https://arxiv.org/abs/2604.27405
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 2 tables. Pre-registered study. Code and data available
Abstract:We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.
[NLP-45] Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中行为机制的可解释性问题,特别是如何识别和干预由强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)所组织的行为电路。其核心挑战在于缺乏对神经元层面因果关系的精确刻画以及对特定行为模板的可控编辑能力。解决方案的关键是提出扰动探测(perturbation probing)方法:通过每条提示仅需两次前向传播、无需反向传播即可生成任务特异性的因果假设,并结合一次约150次传递的干预扫描,实现对全模型神经元的高效分析与定位。该方法揭示了两类通用电路结构——对立电路(opposition circuits)与路由电路(routing circuits),并基于FFN-to-skip信号比这一指标区分二者,从而指导精准干预策略,如在安全拒绝场景中仅需删去约0.014%的神经元即可大幅改变响应格式且几乎不引发有害合规行为,或在多语言输出中通过残差流方向注入实现高达99.1%的切换准确率,展现了机制洞察与模板层精细编辑的双重价值。
链接: https://arxiv.org/abs/2604.27401
作者: Hongliang Liu,Tung-Ling Li,Yuhao Wu
机构: Palo Alto Networks (帕洛阿尔托网络)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen’s concentrated FFN bottleneck to Gemma’s normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.
[NLP-46] Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings ACL2026
【速读】: 该论文旨在解决当前文本嵌入(text embeddings)构建中广泛使用的均值池化(mean pooling)方法是否真正有效的问题。作者指出,均值池化可能因仅保留词元嵌入的一阶统计量而忽略高阶信息(如二阶统计量所反映的空间结构),从而导致不同文本的嵌入分布被错误映射为相似表示,即发生信息坍缩(collapse)。解决方案的关键在于提出一个量化该坍缩程度的简单指标,并基于此指标实证分析现代文本编码器在真实模型和文本中的鲁棒性表现。研究发现,对比学习微调后的文本编码器比预训练基线模型更不易发生此类坍缩,且其鲁棒性源于文本内部词元嵌入的集中性(concentration),同时该鲁棒性与下游任务性能呈正相关,从而揭示了为何看似粗略的均值池化仍能保持有效性。
链接: https://arxiv.org/abs/2604.27398
作者: Tomomasa Hara,Hiroto Kurita,Masaaki Imaizumi,Kentaro Inui,Sho Yokoi
机构: Tohoku University(东北大学); The University of Tokyo(东京大学); Kyoto University(京都大学); RIKEN(理化学研究所); MBZUAI(穆巴达拉人工智能大学); NINJAL(日本国立情报学研究所)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference; GitHub: this https URL
Abstract:For constructing text embeddings, mean pooling, which averages token embeddings, is the standard approach. This paper examines whether mean pooling actually works well in real models. First, we note that mean pooling can collapse information beyond the first-order statistics of the token embeddings, such as second-order statistics that capture their spatial structure, potentially mapping distinct token embedding distributions to similar text embeddings. Motivated by this concern, we propose a simple metric to quantify such a collapse induced by mean pooling. Then, using this metric, we empirically measure how often this collapse occurs in actual models and texts, and find that modern text encoders are robust to this collapse. In particular, contrastive fine-tuned text encoders tend to be less prone to the collapse than their pretrained backbone models. We also find that the robustness of these text encoders lies in the concentration of token embeddings within each text. In addition, we find that robustness to the collapse, as quantified by our proposed metric, correlates with downstream task performance. Overall, our findings offer a new perspective on why modern text encoders remain effective despite relying on seemingly coarse mean pooling.
[NLP-47] MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在实现类人多模态交互方面存在的核心瓶颈问题,即传统模型仍采用交替的感知与响应阶段,无法在生成过程中实时融合新输入以进行动态调整;同时,多数模型仅能被动响应用户指令,缺乏对多模态环境的主动行为能力。解决方案的关键在于提出 Omni-Flow 框架——一个统一的流式处理架构,它将多模态输入与输出沿共享时间轴对齐,从而将传统的轮次式交互转化为全双工、时序对齐的交互模式,使模型能够在实时感知的同时完成响应,并自然衍生出如提醒或评论等主动行为。这一设计显著提升了交互的流畅性与智能水平。
链接: https://arxiv.org/abs/2604.27393
作者: Junbo Cui,Bokai Xu,Chongyi Wang,Tianyu Yu,Weiyue Sun,Yingjing Xu,Tianran Wang,Zhihui He,Wenshuo Ma,Tianchi Cai,Jiancheng Gui,Luoyuan Zhang,Xian Sun,Fuwei Huang,Moye Chen,Zhuo Lin,Hanyu Liu,Qingxin Gui,Qingzhe Han,Yuyang Wen,Huiping Liu,Rongkang Wang,Yaqi Zhang,Hongliang Wei,Chi Chen,You Li,Kechen Fang,Jie Zhou,Yuxuan Li,Guoyang Zeng,Chaojun Xiao,Yankai Lin,Xu Han,Maosong Sun,Zhiyuan Liu,Yuan Yao
机构: MiniCPM-o Team, OpenBMB
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.
[NLP-48] Proactive Dialogue Model with Intent Prediction
【速读】: 该论文旨在解决对话模型在多意图场景下因缺乏前瞻性而导致的冗余交互问题(redundant interactions),即当前模型仅基于用户当前轮次进行响应,未能预判后续可能的意图。解决方案的关键在于引入一种轻量级的意图转移先验(intent-transition prior),该先验通过基于MultiWOZ 2.2数据集中每轮意图标注训练得到的时序贝叶斯网络(Temporal Bayesian Network, T-BN)构建,并以系统提示词(system prompt)形式在推理阶段注入,从而引导模型更主动地规划对话流程。实验表明,该方法在不修改语言模型结构的前提下显著提升了意图覆盖效率和覆盖率指标。
链接: https://arxiv.org/abs/2604.27379
作者: Yang Luo
机构: University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 1 figure
Abstract:Dialogue models are inherently reactive, responding to the current user turn without anticipating upcoming intents, which leads to redundant interactions in multi-intent settings. We address this limitation by introducing a lightweight intent-transition prior derived from dialogue data and injected into the system prompt at inference time. We instantiate this prior using a Temporal Bayesian Network (T-BN) trained on per-turn intent annotations in MultiWOZ 2.2. The T-BN achieves Recall@5 = 0.787 and MRR = 0.576 on 1,071 held-out USER-turn pairs. In a ground-truth replay over 200 dialogues, BN-guided generation improves Coverage AUC from 0.742 to 0.856 and reduces the number of turns required to reach 75% intent coverage from 3.95 to 2.73. These results show that lightweight intent-transition guidance enables more proactive and efficient dialogue behavior without modifying the underlying language model.
[NLP-49] Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
【速读】: 该论文旨在解决监督式金融自然语言处理(Natural Language Processing, NLP)基准测试中因评估标准(如评分细则、指标选择和聚合策略)敏感性而导致的测量风险问题,即所谓“金标准标签”(gold labels)并非绝对客观,从而影响模型选择与部署的可靠性。解决方案的关键在于引入一种“指标可识别性审计”(metric-identifiability audit),通过系统性筛选在特定类分布下仍具信息量的评估指标(如精确准确率、宏F1和加权kappa),确保模型排名结论的稳健性;在此基础上,Bradley-Terry、Borda和Ranked Pairs等排序方法在可识别指标子集上达成一致,显著提升了金融NLP基准结果的可信度与治理规范性。
链接: https://arxiv.org/abs/2604.27374
作者: Sidi Chang,Peiying Zhu,Yuxiao Chen,Rongdong Chai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 Pages, Submitted to IEEE Computational Intelligence in Financial Engineering and Economics (CIFEr) 2026, Tokyo, JP
Abstract:As LLMs become credible readers of earnings calls, investor-relations Q\A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy. We study this measurement risk on Japanese Financial Implicit-Commitment Recognition (JF-ICR; a pinned 253-item test split x 4 frontier LLMs x 5 rubrics x 3 temperatures x 5 ordinal metrics). Three findings follow. First, rubric wording materially changes model-assigned labels: R2–R3 agreement ranges from 70.0% to 83.4%, with the dominant movement near the +1 / 0 implicit-commitment boundary. This pattern is consistent with a pragmatic-boundary interpretation, but is not a validated linguistic-causality claim because the present rubric variants confound semantics, examples, and verbosity. Second, not every metric remains informative under the JF-ICR class distribution. Within-one accuracy is too easy because near misses receive credit and the majority class dominates; worst-class accuracy is too noisy because the rarest class has only two examples. Exact accuracy, macro-F1, and weighted \kappa are therefore the identifiable metrics under our operational rule. Third, ranking claims become more defensible only after this metric-identifiability audit: Bradley–Terry, Borda, and Ranked Pairs agree on the identifiable metric subset, while the full five-metric sweep produces disagreement on the closest pair. The contribution is not a new leaderboard, but a reporting discipline for supervised financial benchmarks whose gold labels exist and whose evaluation ruler still requires governance.
[NLP-50] Emotion-Aware Clickbait Attack in Social Media
【速读】: 该论文旨在解决现有点击诱饵(clickbait)检测系统依赖表面特征、忽视情感动态建模而导致的鲁棒性不足问题。其核心解决方案是提出一种基于情绪感知的生成式攻击框架,关键在于引入Valence-Arousal-Dominance(VAD)情绪空间来量化和优化点击诱饵的情感强度,并通过Curiosity Gap(CG)函数衡量标题与原始内容之间的情绪差异,从而增强用户好奇心并规避现有检测模型。该方法利用Sentence-BERT对齐语义相似的社交媒体帖子,并借助大语言模型(LLM)生成多种风格化改写版本,实验证明该策略显著降低主流分类器性能,误判率可达2.58%至30.63%。
链接: https://arxiv.org/abs/2604.27369
作者: Syed Mhamudul Hasan,Mohd. Farhan Israk Soumik,Abdur R. Shahid
机构: Southern Illinois University (南方伊利诺伊大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Clickbait is characterized by disproportionately high emotional intensity relative to informational content, often reinforced by specific structural patterns. However, current research considers clickbait as a static textual phenomenon characterized by linguistic patterns and structural cues. Additionally, existing detection systems primarily rely on surface-level features of clickbait. This paper introduces an emotion-aware clickbait generation attack, where stylistic transformations are used to optimize emotional impact. We propose an emotion-aware framework based on the Valence-Arousal-Dominance (VAD) space to model the emotional dynamics underlying clickbait generation for optimal user engagement. To simulate realistic attack scenarios, we align clickbait headlines with semantically similar social media posts using Sentence-BERT and generate multiple stylistic rewrites via Large Language Models (LLMs). Building on this, we define a Curiosity Gap (CG) function that computes clickbait’s headline variation to the current post to quantify how emotional activation will contribute to user curiosity and evade the existing system found on social media. Experimental results demonstrate that emotion-aware stylization significantly degrades the performance of state-of-the-art classifiers, leading to misclassification rates of up to 2.58% to 30.63% on the base system.
[NLP-51] IO-SHACL: Comprehensive SHACL validation for TMF Intent Ontologies ISWC
【速读】: 该论文旨在解决意图驱动网络(Intent-based Networking)中因缺乏形式化验证机制而导致的网络意图正确性无法保障的问题。当前,尽管TM Forum Intent Ontology(tio)提供了标准化的网络意图表达词汇,但其在准入前缺少有效的语法与语义验证手段,可能导致配置错误或逻辑冲突。解决方案的关键在于提出tio-shacl——首个针对tio v3.6.0的全面SHACL(Shapes Constraint Language)验证框架,包含56个节点形状和69个属性形状、25个参数化的SPARQL约束组件,并创新性地引入递归逻辑运算符、基于数量的约束及跨期望关系的验证模式,实现了对87个类、109个属性和72个函数的100%词汇覆盖率,且在三个主流SHACL引擎上具备兼容性与高准确性,从而为网络意图的自动化形式化验证提供了可落地的技术支撑。
链接: https://arxiv.org/abs/2604.27359
作者: Jean Martins,Leonid Mokrushin,Marin Orlic
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 2 figures, target:ISWC
Abstract:Intent-based networking promises to revolutionize telecommunications network management by enabling operators to specify high-level goals rather than low-level configurations. The TM Forum Intent Ontology (tio) provides a standardized vocabulary for expressing network intents, yet lacks formal validation mechanisms to ensure intent correctness before its admission. We present tio-shacl, the first comprehensive SHACL (Shapes Constraint Language) validation framework for the TMF Intent Ontology. Our contribution includes 56 node shapes and 69 property shapes across all 15 tio v3.6.0 ontology modules, a reusable constraint library with 25 parameterized SPARQL-based constraint components, and novel validation patterns for recursive logical operators, quantity-based constraints, and cross-expectation relationships. We pursued 100% vocabulary coverage (87 classes, 109 properties, 72 functions), cross-implementation compatibility across three major SHACL engines, and validation accuracy on a corpus of 133 test cases. tio-shacl is publicly available under MIT license at this https URL and enables automated syntactic and semantic validation of network intents, addressing a critical gap in the field.
[NLP-52] Heterogeneous Scientific Foundation Model Collaboration
【速读】: 该论文旨在解决当前基于语言的智能体(Agentic)大模型系统在科学领域应用受限的问题,尤其是其对自然语言作为通用接口的依赖,使得这些系统难以有效处理物理、生命及社会科学中广泛存在的非语言模态数据(如图像、结构化表格、生物序列等)。解决方案的关键在于提出Eywa框架,通过为特定领域的基础模型(foundation models)添加基于语言模型的推理接口,使语言模型能够指导对非语言数据的推理过程。这一设计使得原本专注于特定任务和数据类型的预测型基础模型可以参与更高层次的推理与决策流程,从而实现跨模态协同,提升复杂科学任务的性能并降低对纯语言推理的依赖。
链接: https://arxiv.org/abs/2604.27351
作者: Zihao Li,Jiaru Zou,Feihao Fang,Xuying Ning,Mengting Ai,Tianxin Wei,Sirui Chen,Xiyuan Yang,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 57 Pages
Abstract:Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world problems, especially in scientific domains where domain-specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic framework designed to extend language-centric systems to a broader class of scientific foundation models. The key idea of Eywa is to augment domain-specific foundation models with a language-model-based reasoning interface, enabling language models to guide inference over non-linguistic data modalities. This design allows predictive foundation models, which are typically optimized for specialized data and tasks, to participate in higher-level reasoning and decision-making processes within agentic systems. Eywa can serve as a drop-in replacement for a single-agent pipeline (EywaAgent) or be integrated into existing multi-agent systems by replacing traditional agents with specialized agents (EywaMAS). We further investigate a planning-based orchestration framework in which a planner dynamically coordinates traditional agents and Eywa agents to solve complex tasks across heterogeneous data modalities (EywaOrchestra). We evaluate Eywa across a diverse set of scientific domains spanning physical, life, and social sciences. Experimental results demonstrate that Eywa improves performance on tasks involving structured and domain-specific data, while reducing reliance on language-based reasoning through effective collaboration with specialized foundation models.
[NLP-53] LLM s Capture Emotion Labels Not Emotion Uncertainty: Distributional Analysis and Calibration of Human–LLM Judgment Gaps
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在情感标注任务中忽视人类标注者间分歧结构的问题,即多数评估仅依赖单一“黄金标准”标签而忽略了人类判断的分布信息。其核心发现是:零样本LLM与人类标注者的情感判断分布存在显著差异,且这种差距无法通过单纯增大模型规模来弥合,而是需要领域内微调(in-domain fine-tuning)。解决方案的关键在于提出一种基于词汇锚定(lexical grounding)的量化透明度评分机制,识别出LLM能可靠捕捉具有显式词汇标记的情感类别,而对依赖语境推理的语用复杂情感则系统性失效;此外,论文进一步设计了三种轻量级后处理校准方法,可将LLM与人类标注分布间的差距降低最多达14%,从而为LLM情感标注何时可替代人工标注提供了明确实践指南。
链接: https://arxiv.org/abs/2604.27345
作者: Keito Inoshita,Xiaokang Zhou,Akira Kawai,Katsutoshi Yada
机构: Kansai University (关西大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心); Shiga University (滋贺大学); Japan Safety Society Research Center (日本安全协会研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human–LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.
[NLP-54] o Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM -based Code Editing ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码编辑任务中因采用全代码生成范式而导致的严重效率瓶颈问题,尤其是在交互式编程助手场景下对低延迟和低成本的迫切需求。现有方法普遍忽视了编辑格式本身对模型生成效率的影响,而仅聚焦于模型规模扩展。其关键解决方案在于提出两种结构感知的差异格式——BlockDiff 和 FuncDiff,它们将代码变更表示为语法上连贯的单元(如控制结构和函数)的块级重写,从而提升LLM生成的自然性和准确性;同时引入AdaEdit策略,使模型能够动态选择最节省token的格式(即结构化差异格式或完整代码),实现在保持与全代码生成相当准确率的前提下,显著降低长代码编辑任务中的延迟和计算成本,降幅超过30%。
链接: https://arxiv.org/abs/2604.27296
作者: Wei Cheng,Yongchang Cao,Chen Shen,Binhua Li,Jue Chen,Yongbin Li,Wei Hu
机构: Nanjing University (南京大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted in the Findings of ACL 2026
Abstract:Large Language Models (LLMs) are increasingly used for code editing, yet the prevalent full-code generation paradigm suffers from severe efficiency bottlenecks, posing challenges for interactive coding assistants that demand low latency and cost. Despite the predominant focus on scaling model capabilities, the edit format itself has been largely overlooked in model training. In this paper, we begin with a systematic study of conventional diff formats and reveal that fragile offsets and fragmented hunks make generation highly unnatural for LLMs. To address it, we introduce BlockDiff and FuncDiff, two structure-aware diff formats that represent changes as block-level rewrites of syntactically coherent units such as control structures and functions. Furthermore, we propose AdaEdit, a general adaptive edit strategy that trains LLMs to dynamically choose the most token-efficient format between a given diff format and full code. Extensive experiments demonstrate that AdaEdit paired with structure-aware diff formats consistently matches the accuracy of full-code generation, while reducing both latency and cost by over 30% on long-code editing tasks.
[NLP-55] Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM -Based Coding Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的编码代理在利用外部记忆进行调试时,因检索到的内存与当前错误不具实质兼容性而导致的安全风险问题。现有方法仅依赖表面相似性(如堆栈轨迹、终端错误或配置症状)进行Top-k检索,易引发虚假记忆注入(false-positive memory injection),从而误导调试路径。解决方案的关键在于将记忆使用重构为一种风险敏感的选择性控制问题,并提出RSCB-MC(Risk-Sensitive Contextual Bandit Memory Controller)机制:该控制器通过一个基于模式-变体-事件(pattern-variant-episode)结构的存储 schema 保存可复用的问题知识,并将检索证据转化为包含相关性、不确定性、结构兼容性等16个特征的固定上下文状态;其奖励设计对虚假注入施加强惩罚,使“不使用记忆”和“主动回避”成为首要安全策略,从而实现高成功率(62.5% offline replay success rate)与零虚假阳性率的平衡。
链接: https://arxiv.org/abs/2604.27283
作者: Mehmet Iscan
机构: Yildiz Technical University (伊斯坦布尔技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 7 figures, 10 tables. Code and deterministic local artifacts are available at the repository listed in the paper
Abstract:Large language model (LLM)-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved memory is useful only when the current failure is genuinely compatible with a previous one; superficial similarity in stack traces, terminal errors, paths, or configuration symptoms can lead to unsafe memory injection. This paper reframes issue-memory use as a selective, risk-sensitive control problem rather than a pure top-k retrieval problem. We introduce RSCB-MC, a risk-sensitive contextual bandit memory controller that decides whether an agent should use no memory, inject the top resolution, summarize multiple candidates, perform high-precision or high-recall retrieval, abstain, or ask for feedback. The system stores reusable issue knowledge through a pattern-variant-episode schema and converts retrieval evidence into a fixed 16-feature contextual state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost. Its reward design penalizes false-positive memory injection more strongly than missed reuse, making non-injection and abstention first-class safety actions. In deterministic smoke-scale artifacts, RSCB-MC obtains the strongest non-oracle offline replay success rate, 62.5%, while maintaining a 0.0% false-positive rate. In a bounded 200-case hot-path validation, it reaches 60.5% proxy success with 0.0% false positives and a 331.466 microseconds p95 decision latency. The results show that, for coding-agent memory, the key question is not only which memory is most similar, but whether any retrieved memory is safe enough to influence the debugging trajectory.
[NLP-56] When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理具有显式二维结构的任务时,因将输入线性化为一维标记序列所引入的表示负担问题,即“序列化摩擦”(serialization friction)。其核心问题是:当任务依赖于明确的二维空间关系(如矩阵转置、康威生命游戏和LU分解)时,传统基于文本的线性输入方式会破坏行-列对齐和局部邻域信息,从而影响模型性能。解决方案的关键在于构建一种视觉增强路径(vision-augmented pathway),该路径使用与原始语言模型相同的骨干网络,但接收以任务忠实的二维布局呈现的输入内容,从而保留任务相关的二维结构信息。实验表明,视觉增强路径在所有测试任务中均显著优于纯文本路径,且随着任务维度增大,性能差距进一步扩大,验证了保持二维布局对于结构化二维任务的重要性。
链接: https://arxiv.org/abs/2604.27272
作者: Chung-Hsiang Lo,Lu Li,Diji Yang,Tianyu Zhang,Yunkai Zhang,Yoshua Bengio,Yi Zhang
机构: Northeastern University (东北大学); University of Pennsylvania (宾夕法尼亚大学); UC Santa Cruz (加州大学圣克鲁兹分校); Mila - Quebec AI Institute (蒙特利尔魁北克人工智能研究所); University of Montreal (蒙特利尔大学); BAIR, UC Berkeley (伯克利人工智能研究中心,加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) conventionally process structured inputs as 1D token sequences. While natural for prose, such linearization may introduce additional representational burden for tasks whose computation depends directly on explicit 2D structure, because row–column alignment and local neighborhoods are no longer directly expressed in the input. We study this setting, which we refer to as serialization friction, on a small diagnostic testbed of synthetic tasks with explicit 2D structure: matrix transpose, Conway’s Game of Life, and LU decomposition. To examine this question, we compare a text-only language pathway over serialized inputs with a vision-augmented pathway, built on the same language backbone, that receives the same underlying content rendered in task-faithful 2D layout, yielding a system-level comparison between two end-to-end input pathways. Across the tasks and settings we study, the visual pathway consistently outperforms the textual pathway; the gap often widens at larger dimensions, and error patterns under serialization become increasingly spatially structured. These findings indicate that the relationship between input representation and model performance on such tasks warrants further investigation, and suggest that preserving task-relevant 2D layout is a promising direction for structured 2D tasks.
[NLP-57] Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
【速读】: 该论文旨在解决子词分词(subword tokenization)在现代大语言模型(LLM)训练效率与模型性能提升中的具体作用尚不明确的问题。其解决方案的关键在于通过构建一个受控的字节级预训练流水线,将子词分词的影响与其他训练因素解耦,并系统性地验证多个假设,包括样本吞吐量、词汇表规模扩展以及子词边界作为语言先验的作用。实验表明,子词模型优于原始字节模型的核心原因在于更高的训练吞吐量以及将子词边界作为显式先验或归纳偏置的有效整合。
链接: https://arxiv.org/abs/2604.27263
作者: Théo Gigant,Bowen Peng,Jeffrey Quesnelle
机构: Nous Research (Nous 研究所)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures
Abstract:Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.
[NLP-58] Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中推理模式(如归纳、演绎和溯因)是否能从具体问题实例中解耦的问题,这是提升模型可控性(controllability)的关键挑战。解决方案的关键在于通过引入“推理冲突”(reasoning conflicts)——即强制模型遵循与目标任务预期逻辑不符的推理框架,从而揭示模型在参数记忆(parametric memory)与上下文信息之间的权衡机制。研究发现,模型倾向于优先选择符合任务语境的推理方式(即“合理性”),而非严格遵守指令;同时,推理冲突可在模型内部被检测到(如置信度显著下降),且推理类型在中间到晚期层中呈线性编码,表明可通过激活层面干预实现对推理模式的主动控制。基于此,作者设计了针对性的干预策略,使模型指令遵循度提升达29%,验证了逻辑结构可从数据中解耦并实现可控引导的可能性。
链接: https://arxiv.org/abs/2604.27251
作者: Xingwei Tan,Marco Valentino,Mahmud Elahi Akhter,Yuxiang Zhou,Maria Liakata,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学); Queen Mary University of London (伦敦玛丽女王大学); The Alan Turing Institute (艾伦图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. Notably, task accuracy is not strictly determined by sensibility, with models often maintaining high performance even when using conflicting patterns, suggesting a reliance on internalized parametric memory that increases with model size. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.
[NLP-59] Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
【速读】: 该论文旨在解决语言模型在受到“故意表现不佳”(sandbagging)指令时,是否会真正理解问题内容,还是仅依赖位置捷径(positional shortcuts)来生成答案的问题。其核心问题是识别模型在不同指令强度下从内容敏感到内容盲区的转变边界,以及这种转变是否呈单调变化。解决方案的关键在于设计了一个六条件的对抗性指令特异性梯度(adversarial instruction-specificity gradient),通过分布筛选(响应位置熵)和内容参与度独立判据(难度-准确率相关性)联合刻画每种条件下模型的行为模式;结果发现存在三个非单调的响应 regime:模糊指令下内容参与度保持但准确率下降,标准沙袋指令与能力模仿指令导致位置熵坍缩但仍有部分内容感知,而两步式答案感知规避指令则引发极端的位置集中(接近99.9%集中在单一选项),且完全丧失内容敏感性——这是唯一测试的多步骤指令,也是最极端的捷径利用方式。这表明指令复杂性决定了模型在贪婪解码下是否采用内容感知或内容盲目的机制。
链接: https://arxiv.org/abs/2604.27249
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 3 tables. Pre-registered on OSF ( this http URL )
Abstract:When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two instruction-tuned LLMs (Llama-3-8B and Llama-3.1-8B) on 2,000 MMLU-Pro items. Distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) jointly characterise each condition. The gradient reveals three regimes rather than a monotonic transition. Vague adversarial instructions produce moderate accuracy reduction with preserved content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme positional collapse, with near-total concentration on a single response position (99.9% and 87.4%) and no measurable content sensitivity. This was the only multi-step instruction tested, and it produced the most extreme shortcut. The attractor position matches each model’s content-absent null-prompt default. The effect replicates across both models and four academic domains. Distributional collapse and content engagement can co-occur (50% concordance between screening criteria), indicating that entropy-based screening and difficulty-based content assessment capture partially independent dimensions of response validity. Results suggest that instruction complexity can determine whether adversarial compliance uses content-aware or content-blind mechanisms in small instruction-tuned LLMs under greedy decoding.
[NLP-60] argeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs
【速读】: 该论文旨在解决当前手语模型在捕捉手语多种语言现象方面能力不足的问题,尤其是对多发音器(如手部、上半身和面部)线索利用不充分的局限性。其解决方案的关键在于构建了一个新的基准数据集——美国手语最小翻译对(ASL Minimal Translation Pairs, ASL-MTP),该数据集按手语语言现象分类并包含对应的最小翻译对,从而支持对模型进行精细化的语言学分析。通过该数据集,作者对最先进的ASL到英语翻译模型进行了消融实验,系统评估了不同输入线索(如手动与非手动线索)在训练和推理阶段的作用,揭示了模型对手动线索的强依赖性及其对关键非手动线索的忽视问题。
链接: https://arxiv.org/abs/2604.27232
作者: Serpil Karabüklü,Kanishka Misra,Shester Gueuwou,Diane Brentari,Greg Shakhnarovich,Karen Livescu
机构: Toyota Technological Institute at Chicago(丰田工业大学芝加哥分校); The University of Texas at Austin(德克萨斯大学奥斯汀分校); The University of Chicago(芝加哥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Models of sign language have historically lagged behind those for spoken language (text and speech). Recent work has greatly improved their performance on tasks like sign language translation and isolated sign recognition. However, it remains unclear to what extent existing models capture various linguistic phenomena of sign language, and how well they use cues from the multiple articulators used in sign language (hands, upper body, face). We introduce a new benchmark dataset for American Sign Language, ASL Minimal Translation Pairs (ASL-MTP), divided into multiple types of sign language phenomena and corresponding minimal pairs of translations, for performing such linguistic analyses. As a case study, we use ASL-MTP to analyze a state-of-the-art ASL-to-English translation model. We conduct a targeted analysis of the model by ablating various input cues during training and inference and evaluating on the phenomena in ASL-MTP. Our results show that, while the model performs above chance level on most of the phenomena, it relies strongly on manual cues while often missing crucial non-manual cues.
[NLP-61] Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping LREC2026
【速读】: 该论文旨在解决通用自动音素转写(Universal Automatic Phonetic Transcription, APT)中高质量、多样化训练转写数据稀缺的问题。解决方案的关键在于提出一种名为“选择性增强”(Selective Augmentation)的自举方法,通过有选择地从辅助语言(如印地语)中迁移语音特征差异来扩充现有训练数据,从而提升模型性能。实验表明,该方法不仅显著提高了浊音(plosive voicing)识别准确率(减少假阳性导致准确率提升17.6%),还成功引入了清送气特征(plosive aspiration)的识别能力——使德语中的/p, t, k/在61.2%的情况下被正确标注为送气音,同时有效降低了浊音类别的混淆,使紧音类(tenuis class)占比下降32.2%。
链接: https://arxiv.org/abs/2604.27204
作者: Tobias Bystrich,Julia M. Pritzen,Christoph A. Schmidt,Claudia Wich-Reif
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at LREC 2026
Abstract:In the field of universal automatic phonetic transcription (APT), clean and diverse training transcriptions are required. However, such high-quality data is limited. We propose the bootstrapping approach Selective Augmentation to improve the available training transcriptions by selectively transferring distinctions between languages. Based on the model MultIPA, we exemplarily show that we could increase the accuracy of an existing feature (plosive voicing) and add a new feature (plosive aspiration) by augmenting the existing training data using information from a separate helper language (Hindi). We describe intrinsic challenges of the evaluation and develop objective metrics to determine the success: Voicing accuracy was increased by 17.6% by reducing the number of false positives. Additionally, aspiration recognition was introduced: While the baseline transcribed 0% of German /p, t, k/ as aspirated, our approach transcribed them as aspirated in 61.2% of the cases. Introducing aspiration recognition to APT models allowed for the tenuis class to be successfully reduced by 32.2%, which also reduces the conflations between the test language’s plosives.
[NLP-62] Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation
【速读】: 该论文旨在解决混合思维语言模型中“思考模式”(think mode)与“非思考模式”(no-think mode)耦合导致的推理泄露问题,即在本应直接生成答案的非思考模式下,模型仍会输出冗长且自我反思的响应,影响效率与准确性。其解决方案的关键在于提出路径锁定专家(Path-Lock Expert, PLE)架构:在每个解码层中用两个语义锁定的专家(一个专用于思考,一个专用于非思考)替代原有的单一多层感知机(MLP),并通过确定性控制令牌路由机制确保整条序列仅激活一个专家路径,从而实现模式隔离。该设计保持注意力、嵌入、归一化及语言模型头共享,同时在监督微调过程中使每个专家接收纯模式更新,有效提升了非思考模式的准确性和简洁性,同时不损害思考模式性能。
链接: https://arxiv.org/abs/2604.27201
作者: Shouren Wang,Wang Yang,Chuang Ma,Debargha Ganguly,Vikash Singh,Chaoda Song,Xinpeng Li,Xianxuan Long,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); NII LLMC, Japan (日本国立情报学研究所语言模型与计算中心); Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 9 figures, 6 tables. Under review
Abstract:Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data curation and multi-stage training, yet leakage remains because both modes are still encoded in the same feed-forward parameters. We propose Path-Lock Expert (PLE), an architecture-level solution that replaces the single MLP in each decoder layer with two semantically locked experts, one for think and one for no-think, while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model’s per-token computation pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks, PLE maintains strong think performance while producing a substantially stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage. On Qwen3-4B, for example, PLE reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%, all while preserving think-mode performance. These results suggest that controllable hybrid thinking is fundamentally an architectural problem, and separating mode-specific feed-forward pathways is a simple and effective solution.
[NLP-63] Semantic Structure of Feature Space in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中语义特征的几何结构是否与人类心理关联一致的问题。其核心问题是:LLMs 的隐藏状态中的语义特征是否在高维空间中呈现出与人类感知相匹配的几何关系,以及这种关系能否被量化和利用。解决方案的关键在于构建360个词的语义特征向量,并将其投影到32个语义轴上(如“美丽-丑陋”、“柔软-坚硬”),发现这些投影与人类对词汇在对应语义尺度上的评分高度相关;进一步揭示出语义轴之间的余弦相似性可预测人类对语义尺度间相关性的感知;同时,32个语义轴的大部分方差位于一个低维子空间,复现了人类语义关联的典型模式;最后,通过沿某一语义轴“引导”(steering)词语,观察到其在其他语义尺度上的评分变化与其对应语义轴间的余弦相似度成比例,表明语义特征应从其几何关系及所构成的有意义子空间角度理解,而非孤立分析。
链接: https://arxiv.org/abs/2604.27169
作者: Austin C. Kozlowski,Andrei Boutyline
机构: University of Chicago (芝加哥大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We show that the geometric relations between semantic features in large language models’ hidden states closely mirror human psychological associations. We construct feature vectors corresponding to 360 words and project them on 32 semantic axes (e.g. beautiful-ugly, soft-hard), and find that these projections correlate highly with human ratings of those words on the respective semantic scales. Second, we find that the cosine similarities between the semantic axes themselves are highly predictive of the correlations between these scales in the survey. Third, we show that substantial variance across the 32 semantic axes lies on a low-dimensional subspace, reproducing patterns typical of human semantic associations. Finally, we demonstrate that steering a word on one semantic axis causes spillover effects on the model’s rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes. These findings suggest that features should be understood not only in isolation but through their geometric relations and the meaningful subspaces they form.
[NLP-64] Cross-Lingual Response Consistency in Large Language Models : An ILR-Informed Evaluation of Claude Across Six Languages
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下输出质量评估缺乏系统性、语境敏感且可解释的框架问题,尤其关注不同语言间响应差异的成因及其对公平部署的影响。其解决方案的关键在于构建一个基于国际语言能力描述框架(Interagency Language Roundtable, ILR)的评估体系,并结合自动化定量指标与具备多年ILR口语能力评估(Oral Proficiency Interview, OPI)经验的专业人员进行定性分析,从而揭示Claude(Sonnet 4.6)在六种语言中表现出的五类跨语言变异模式:语用消歧策略差异、文学审美传统分化、术语规范内生性、文化校准缺失以及情感支持中的机构指代偏好,表明该方法能够有效识别并解释LLM跨语言行为的结构性差异,为实现更公平、可解释的多语言AI系统提供新路径。
链接: https://arxiv.org/abs/2604.27137
作者: Camelia Baluta
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 prompt clusters 6 languages 3 runs; data and code at this http URL
Abstract:This paper introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages: English, French, Romanian, Spanish, Italian, and German. We administer a battery of 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collect 216 responses (12 prompts, 6 languages, 3 runs), and analyze outputs through a two-layer methodology combining automated quantitative metrics with expert ILR qualitative assessment. Quantitative analysis reveals that French responses are approximately 30% longer than German responses on identical prompts, and that creative and affective clusters show the highest cross-lingual surface divergence. Qualitative analysis, conducted by a six-language professional with 12 years of ILR/OPI assessment experience, identifies five cross-lingual variation patterns: systematic differences in pragmatic disambiguation strategies, aesthetic and literary tradition divergence in creative output, language-internal technical terminology norms, cultural calibration gaps evidenced by the absence of culture-specific content in favor of culturally neutralized templates, and language-specific institutional referral behavior in emotional support responses. We argue that ILR-informed expert judgment applied to LLM outputs constitutes a novel and underreported evaluation methodology that complements purely computational benchmarks, and that cross-lingual output variation in Claude is interpretable, domain-dependent, and consequential for equitable multilingual AI deployment.
[NLP-65] Exploring the Limits of Pruning: Task-Specific Neurons Model Collapse and Recovery in Task-Specific Large Language Models
【速读】: 该论文旨在解决任务特定语言模型中神经元贡献不均的问题,即是否所有神经元对目标任务的性能具有同等重要性。其核心解决方案是引入基于激活的可选择性度量(activation-based selectivity metric),用于识别对目标任务贡献较低的神经元,并通过选择性剪枝保留关键任务性能。实验表明,相较于随机剪枝,选择性剪枝能更有效地减少参数和显存占用,同时维持较高的任务准确性;进一步的反向剪枝实验揭示了约10%高度任务特异性的神经元对模型性能至关重要,而剪枝掉约30%-35%的非关键神经元仍可保持显著性能,验证了神经元功能分化与模型冗余的存在。
链接: https://arxiv.org/abs/2604.27115
作者: M. K. Khalidi Siam,Md. Tausif-Ul-Islam,Md. Reshad Romim Khan,Mohammed Ali Hossain,Mushfiqul Amin,Labib Hasan Khan,Niloy Farhan,Farig Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Neuron pruning is widely used to reduce the computational cost and parameter footprint of large language models, yet it remains unclear whether neurons in task-specific models contribute uniformly to task performance. In this work, we provide empirical evidence for the existence and importance of task-specific neurons through a systematic pruning study on language models specialized for mathematical reasoning and code generation. We introduce an activation-based selectivity metric to identify neurons with low contribution to the target task and prune them while preserving target-task accuracy, and compare selective pruning with random pruning. Selective pruning consistently outperforms random pruning, indicating that activation-based selectivity provides a systematic advantage over random pruning. Reverse pruning experiments further show that removing a small subset of highly task-specific neurons (~10%) causes complete performance collapse, suggesting that there exist task specific neurons and critical task information is concentrated in a small portion of the network. In contrast, selective pruning of less critical neurons (~30% - ~35%) reduces accuracy but still preserves significant performance. We also observed consistent reductions in parameters and runtime VRAM usage, along with improved inference throughput as pruning increases. Experiments on both 1.5B and 7B models reveal a robustness threshold around 15-20% pruning, beyond which accuracy loss and generation failures increase sharply. Fine-tuning substantially recovers performance across pruning levels, particularly for aggressively pruned models. These findings provide empirical evidence of neuron specialization in task-specific language models and offer insights into pruning robustness, model redundancy, and post-pruning recoverability.
[NLP-66] Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全对齐技术中一个被忽视的关键问题:模型在面对良性用户意图澄清时,是否以及如何恢复有用性(utility),同时保持安全性。现有方法多关注单轮对话中的对抗攻击鲁棒性,却未评估模型在多轮交互中根据用户后续澄清修正自身理解并恢复服务的能力。其解决方案的核心是提出 CarryOnBench——首个交互式基准测试,通过模拟 5,970 次包含 398 个看似有害但实际具有良性意图的查询及其多种用户跟进序列,系统评估 14 种模型在多轮对话中的意图对齐效用(intent-aligned utility)与安全性。研究设计了 Ben-Util 检查表指标量化每轮响应满足用户良性信息需求的程度,并识别出三种单轮评估无法发现的失败模式:效用锁定(utility lock-in)、不安全恢复(unsafe recovery)和重复恢复(repetitive recovery),揭示了模型行为存在“过度保守”或“响应迟钝”的本质差异,从而推动更动态、可解释的 LLM 安全与效用平衡机制发展。
链接: https://arxiv.org/abs/2604.27093
作者: Mingqian Zheng,Malia Morgan,Liwei Jiang,Carolyn Rose,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current LLM safety alignment techniques improve model robustness against adversarial attacks, but overlook whether and how LLMs can recover helpfulness when benign users clarify their intent. We introduce CarryOnBench, the first interactive benchmark that measures whether LLMs can revise their interpretation of user intent and recover utility, while remaining safe through multi-turn conversations. Starting from 398 seemingly harmful queries with benign underlying intents, we simulate 5,970 conversations by varying user follow-up sequences, evaluating 14 models on both intent-aligned utility and safety. CarryOnBench yields 1,866 different conversation flows of 4–12 turns, totaling 23,880 model responses. We design Ben-Util, a checklist-based metric that evaluates how well each model response fulfills the user’s benign information need using atomic items. At turn one, models fulfill only 10.5–37.6% of the user’s benign information need. When the same query includes the benign intent upfront, models fulfill 25.1–72.1%, confirming that models withhold information due to intent misinterpretation, not limited knowledge. With benign clarifications in multi-turn conversations, 13 of 14 models approach or exceed this single-turn baseline, yet recovery cost varies across models. We identify three failure modes invisible to single-turn evaluations: utility lock-in, where a model rarely updates despite clarification; unsafe recovery, where a model updates at disproportionate safety cost; and repetitive recovery, where a model recycles prior responses rather than providing new information. Moreover, conversations converge to similar harmfulness levels regardless of how conservative the model starts. These findings expose a gap that single-turn evaluations miss – whether a model is appropriately cautious or simply unresponsive to clarified user intent.
[NLP-67] Detecting Clinical Discrepancies in Health Coaching Agents : A Dual-Stream Memory and Reconciliation Architecture
【速读】: 该论文旨在解决长期健康代理(longitudinal health agents)在管理患者纵向医疗旅程时,因依赖两种不完美信息源——患者自我报告(易受回忆偏差影响)与电子健康记录(EHR,常存在数据滞后)——而导致的记忆一致性问题。其核心挑战在于:通用代理记忆系统通过覆盖旧事实来维持叙述连贯性,但在临床场景中可能引发安全风险。解决方案的关键是提出一种双流记忆架构(Dual-Stream Memory Architecture),严格分离患者叙事流与结构化临床记录流(FHIR),并通过专用的校准引擎(Reconciliation Engine)对每次提取的记忆进行逐项比对,按类型、严重程度及具体FHIR资源分类差异。实证表明,该机制可有效识别84.4%的设计临床差异,且在关键安全场景下召回率达86.7%,并揭示了13.6%的错误级联主要源于从非结构化对话中提取临床细节时的信息丢失,而非下游分类错误,从而验证了将患者自述记忆与临床记录交叉验证的可行性与必要性。
链接: https://arxiv.org/abs/2604.27045
作者: Samuel L Pugh,Eric Yang,Alexander Muir Sutherland,Alessandra Breschi
机构: Verily Health Inc (Verily健康公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As Large Language Model (LLM) agents transition from single-session tools to persistent systems managing longitudinal healthcare journeys, their memory architectures face a critical challenge: reconciling two imperfect sources of truth. The patient’s evolving self-report is current but prone to recall bias, while the Electronic Health Record (EHR) is medically validated but frequently stale. General-purpose agent memory systems optimize for coherence by overwriting older facts with the user’s latest statement, a pattern that risks safety failures when applied to clinical data. We introduce a Dual-Stream Memory Architecture that strictly separates the patient narrative from the structured clinical record (FHIR), governed by a dedicated Reconciliation Engine that evaluates every extracted memory against the patient’s FHIR profile and classifies discrepancies by type, severity, and the specific FHIR resources involved. We evaluate this architecture on 26 patients across 675 longitudinal wellness coaching sessions, using a hybrid dataset that interleaves real provider-patient transcripts with synthetic, FHIR-grounded clinical scenarios. In isolated testing, the engine detects 84.4% of designed clinical discrepancies with 86.7% safety-critical recall. By coupling extraction and reconciliation evaluation on the same data, we directly quantify a 13.6% error cascade, tracing the degradation to clinical details lost during memory extraction from unstructured conversation rather than to downstream classification errors. These findings establish that validating patient-reported memories against clinical records is both feasible and necessary for safe deployment of longitudinal health agents.
[NLP-68] CL-bench Life: Can Language Models Learn from Real-Life Context?
【速读】: 该论文旨在解决当前前沿语言模型(Language Models, LMs)在处理真实生活场景中的复杂、碎片化上下文时表现不佳的问题,即现实情境下的上下文学习能力(real-life context learning)尚未得到充分验证与提升。解决方案的关键在于构建了一个由人类精心标注的基准测试集CL-bench Life,该基准包含405个上下文-任务对和5,348条验证标准,覆盖日常生活中多用户对话、个人档案及行为轨迹等典型场景,从而系统性地评估模型在真实世界语境中推理和解决问题的能力。实验表明,即使是最先进的模型在该基准上的任务解决率也仅为19.3%,凸显了现有模型在现实情境理解上的显著不足,为未来研究提供了明确的方向和可靠评估工具。
链接: https://arxiv.org/abs/2604.27043
作者: Shihan Dou,Yujiong Shen,Chenhao Huang,Junjie Ye,Jiayi Chen,Junzhe Wang,Qianyu He,Shichun Liu,Changze Lv,Jiahang Lin,Jiazheng Zhang,Ming Zhang,Shaofan Liu,Tao Ji,Zhangyue Yin,Cheng Zhang,Huaibing Xie,Jianglu Hu,Jingcheng Deng,Lincheng Li,Minda Hu,Shaolei Wang,Syrus Zhao,Weichao Wang,Yan Lei,Yang Liu,Yanling Xiao,Yiting Liu,Zenan Xu,Zhen Guo,Ziliang Zhao,Pluto Zhou,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Di Wang,Shunyu Yao
机构: Tencent; Fudan University
类目: Computation and Language (cs.CL)
备注: 50 pages, 11 figures
Abstract:Today’s AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks. We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life. CL-bench Life provides a crucial testbed for advancing real-life context learning, and progress on it can enable more intelligent and reliable AI assistants in everyday life.
[NLP-69] Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
【速读】: 该论文旨在解决现代自回归模型中生成长度(generation length)建模缺乏细粒度控制的问题,现有方法主要在粗粒度序列层面进行操作,难以实现对推理效率与性能之间权衡的精确调控。其解决方案的关键在于提出长度价值模型(Length Value Model, LenVM),将长度建模转化为一个值估计问题:通过为每个生成的token赋予恒定负奖励,LenVM预测一个有界且折扣化的回报,该回报作为剩余生成时长的单调代理信号。此设计无需人工标注、具有密集性、无偏性和可扩展性,能够在推理阶段提供有效且可解释的长度控制信号,并显著提升任务如LIFEBench中的精确长度匹配表现,同时支持在固定token预算下维持高精度输出(如GSM8K任务中从6%提升至63%)。
链接: https://arxiv.org/abs/2604.27039
作者: Zhen Zhang,Changyi Yang,Zijie Xia,Zhen Yang,Chengzhi Liu,Zhaotiao Weng,Yepeng Liu,Haobo Chen,Jin Pan,Chenyang Zhao,Yuheng Bu,Alkesh Patel,Zhe Gan,Xin Eric Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Token serves as the fundamental unit of computation in modern autoregressive models, and generation length directly influences both inference cost and reasoning performance. Despite its importance, existing approaches lack fine-grained length modeling, operating primarily at the coarse-grained sequence level. We introduce the Length Value Model (LenVM), a token-level framework that models the remaining generation length. By formulating length modeling as a value estimation problem and assigning a constant negative reward to each generated token, LenVM predicts a bounded, discounted return that serves as a monotone proxy for the remaining generation horizon. This formulation yields supervision that is annotation-free, dense, unbiased, and scalable. Experiments on LLMs and VLMs demonstrate LenVM provides a highly effective signal at inference time. On the LIFEBench exact length matching task, applying LenVM to a 7B model improves the length score from 30.9 to 64.8, significantly outperforming frontier closed-source models. Furthermore, LenVM enables continuous control over the trade off between performance and efficiency. On GSM8K at a budget of 200 tokens, LenVM maintains 63% accuracy compared to 6 percent for token budget baseline. It also accurately predicts total generation length from the prompt boundary. Finally, LenVM’s token-level values offer an interpretable view of generation dynamics, revealing how specific tokens shift reasoning toward shorter or longer regimes. Results demonstrate that LenVM supports a broad range of applications and token length can be effectively modeled as a token-level value signal, highlighting the potential of LenVM as a general framework for length modeling and as a length-specific value signal that could support future RL training. Code is available at this https URL.
[NLP-70] Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
【速读】: 该论文旨在解决安全对齐语言模型在训练过程中如何在拒绝有害请求与避免过度拒绝(broad over-refusal)之间保持平衡的问题,即揭示动态对抗微调(dynamic adversarial fine-tuning)如何改变拒绝行为的载体(refusal carriers)及其几何结构。其关键发现在于:R2D2风格的动态对抗微调并非简单地导致模型拒绝行为的漂移(drift-only),而是通过在训练中期重构拒绝信号的载体——从后期层保留可接受的拒绝路径,随后迁移至早期层承载拒绝能力,同时维持极低的有效秩(effective rank ≈ 1.23–1.27),表明存在一种有组织的重新配置(reorganization)机制,而非随机变化。这一机制支持了拒绝行为的可控性与低维耦合控制,为理解安全对齐训练中的内在动力学提供了测量驱动的实证依据。
链接: https://arxiv.org/abs/2604.27019
作者: Wenhao Lan,Shan Li,Junbin Yang,Haihua Shen,Yijun Yang
机构: University of Chinese Academy of Sciences, Beijing, China; Inner Mongolia University of Technology, Inner Mongolia, China; Shandong University, Shandong, China
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbreak robustness, yet does not explain how dynamic adversarial fine-tuning changes refusal carriers across training. We present a measurement-driven mechanism study, not a new defense, on one 7B backbone under supervised fine-tuning (SFT) and R2D2-style dynamic adversarial fine-tuning. Our protocol aligns fixed-source HarmBench, StrongREJECT, and XSTest with a five-anchor refusal-geometry suite and causal interventions. R2D2 drives fixed-source HarmBench ASR to 0.000 at steps 50 and 100, then partially reopens to 0.035 at step 250 and 0.250 at step 500; SFT remains less robust, with ASR between 0.505 and 0.588 at the same anchors. On XSTest, R2D2 any-refusal is 1.000 early, then falls to 0.664 and 0.228. Geometrically, R2D2 preserves a late-layer admissible carrier through step 100 before relocating to an early-layer carrier, while effective rank remains near 1.23–1.27. Causal interventions indicate low-dimensional but utility-coupled control. These results support a reorganization account rather than a drift-only account, with evidence limited to one backbone and fixed-source attacks.
[NLP-71] BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task
【速读】: 该论文旨在解决数字电池护照(Digital Battery Passport, DBP)合规性分类这一新兴任务缺乏公开基准数据集和评估标准的问题。随着欧盟电池法规对DBP的强制实施,亟需可靠的方法来自动化验证电池数据是否符合规范。解决方案的关键在于构建首个公开的合成基准数据集BatteryPass-12K,并在此基础上系统评估22种语言模型(Language Models, LMs)在零样本和少样本场景下的表现,发现思维型模型(Thinking models)在该任务中表现最优,且少量示例可显著提升性能;同时揭示了单纯扩大模型参数规模并不必然带来性能提升,以及提示注入攻击会削弱模型鲁棒性。该工作为未来电池领域中的生命周期推理等任务提供了可复用的数据基础与方法参考。
链接: https://arxiv.org/abs/2604.26986
作者: Tosin Adewumi,Martin Karlsson,Lama Alkhaled,Marcus Liwicki
机构: Luleå University of Technology (吕勒奥理工大学); EISLAB (电气与信息系统实验室); Research Center for Advanced Battery Technology (先进电池技术研究中心)
类目: Computation and Language (cs.CL)
备注: 19 pages, 4 figures
Abstract:We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU’s battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model parameters does not necessarily lead to improved performance, as SLMs outperformed some LLMs, and (5) prompt-injection attacks degrade performance. We note that BatteryPass-12K, though limited to real pilot samples, may be useful for other known or emerging tasks in the battery domain, e.g. lifecycle reasoning. We publicly release the dataset under a permissive licence (CC-BY-4.0).
[NLP-72] DeepTutor: Towards Agent ic Personalized Tutoring
【速读】: 该论文旨在解决当前教育领域中大型语言模型(Large Language Models, LLMs)在个性化辅导方面的局限性:传统教学系统依赖静态预训练知识,无法适应个体学习者差异;而现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的系统则难以提供有针对性且引导性的反馈。其解决方案的核心是提出 DeepTutor——一个原生代理(agent-native)的开源框架,通过一个共享的个性化基础结构(personalization substrate)实现多维度个性化能力。关键创新在于:1)设计混合个性化引擎,融合静态知识锚定与动态多分辨率记忆机制,将交互历史提炼为持续演化的学习者画像;2)构建闭合式教学循环,双向耦合引用支撑的问题求解与难度校准的问题生成;3)引入 TutorBot 作为主动式多智能体层,以可扩展技能和统一多通道接口实现跨平台一致体验。该方案不仅提升了个性化教学质量,同时保持了基础代理推理能力,为下一代 AI 驱动的个性化教学系统提供了新范式。
链接: https://arxiv.org/abs/2604.26962
作者: Bingxi Zhao,Jiahao Zhang,Xubin Ren,Zirui Guo,Tianzhe Chu,Yi Ma,Chao Huang
机构: University of Hong Kong (香港大学); Beijing Jiaotong University (北京交通大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 7 figures, 7 tables. Code available at this https URL
Abstract:Education represents one of the most promising real-world applications for Large Language Models (LLMs). However, conventional tutoring systems rely on static pre-training knowledge that lacks adaptation to individual learners, while existing RAG-augmented systems fall short in delivering personalized, guided feedback. To bridge this gap, we present DeepTutor, an agent-native open-source framework for personalized tutoring where every feature shares a common personalization substrate. We propose a hybrid personalization engine that couples static knowledge grounding with dynamic multi-resolution memory, distilling interaction history into a continuously evolving learner profile. Moreover, we construct a closed tutoring loop that bidirectionally couples citation-grounded problem solving with difficulty-calibrated question generation. The personalization substrate further supports collaborative writing, multi-agent deep research, and interactive guided learning, enabling cross-modality coherence. To move beyond reactive interfaces, we introduce TutorBot, a proactive multi-agent layer that deploys tutoring capabilities through extensible skills and unified multi-channel access, providing consistent experience across platforms. To better evaluate such tutoring systems, we construct TutorBench, a student-centric benchmark with source-grounded learner profiles and a first-person interactive protocol that measures adaptive tutoring from the learner’s perspective. We further evaluate foundational agentic reasoning abilities across five authoritative benchmarks. Experiments show that DeepTutor improves personalized tutoring quality while maintaining general agentic reasoning abilities. We hope DeepTutor provides unique insights into next-generation AI-powered and personalized tutoring systems for the community.
[NLP-73] Universal statistical laws governing culinary design
【速读】: 该论文试图解决的问题是:传统食谱是否遵循类似于语言或其他符号系统中的统计规律。通过分析涵盖全球 cuisines 的大规模食谱语料库,并利用先进的命名实体识别算法将食谱标注为食材、烹饪技法、厨具及其他属性,研究发现食材使用呈现类似Zipf定律的秩频缩放关系,烹饪多样性随语料库规模增长呈次线性增长(符合Heaps定律),且食谱复杂度符合Menzerath-Altmann型关系——即组成单元数量与其平均信息量之间存在负相关。此外,宏营养素浓度在不同食谱中表现出对数正态分布特征。解决方案的关键在于构建基于优先重用(preferential reuse)、受限采样(constrained sampling)和渐进修改(incremental modification)的最小生成模型,这些模型能够复现上述统计规律,表明跨文化食谱结构由简单且受约束的生成过程塑造,从而确立食谱作为组合性符号系统的地位。
链接: https://arxiv.org/abs/2604.28021
作者: Ganesh Bagler,Gopal Krishna Tewari,Aditya Raj Yadav,Akshat Singh,Pranay Bansal,Ujjval Dargar,Mansi Goel,Madhvi Kumari Sinha
机构: 未知
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注: 48 Pages (28 Pages of Main Manuscript + Supplementary Information), 4 Main Figures, 6 Extended Data Figures
Abstract:Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much like languages do through words and syntax. Yet, beneath the apparent diversity of culinary traditions, whether recipes obey statistical laws comparable to those of other symbolic systems remains unknown. Here we analyze a large corpus of traditional recipes spanning global cuisines, annotated using a state-of-the-art named entity recognition algorithm into ingredients, cooking techniques, utensils, and other culinary attributes. We find that ingredient usage exhibits Zipf-like rank-frequency scaling, that culinary diversity grows sublinearly with corpus size in accordance with Heaps’ law, and that recipe complexity follows Menzerath-Altmann-type relations between the number and average information of constituent units. Consistent with observations in packaged foods, macronutrient concentrations across recipes also display a log-normal signature. Minimal generative models based on preferential reuse, constrained sampling, and incremental modification recapitulate these regularities, suggesting generic processes that shape recipe architecture across cultures. Together, these findings establish recipes as a compositional symbolic system in which complex structure emerges from simple, constrained generative processes.
信息检索
[IR-0] Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing SIGIR2026
【速读】:该论文旨在解决多向量检索模型(multivector retrieval models)在部署时面临的高计算与内存开销问题。现有基于k-means聚类的方法虽能实现压缩与高效检索,但其扩展性差,且在训练中偏向高频词而忽视稀有但具有判别性的低频词。解决方案的关键在于提出TACHIOM系统,该系统通过利用token级别的结构,在聚类阶段引入考虑token分布的中心点分配机制,从而实现百万级中心点的高效聚类;同时结合基于图的中心点索引与优化的乘积量化(Product Quantization)布局,在最终打分阶段避免昂贵的token级计算,显著提升检索效率。实验表明,TACHIOM相较传统k-means聚类提速达247倍,检索速度比最先进系统快9.8倍,同时保持或优于原有效果。
链接: https://arxiv.org/abs/2604.28142
作者: Silvio Martinico,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini
机构: University of Pisa (比萨大学); ISTI–CNR (意大利国家研究委员会信息科学与技术研究所)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, SIGIR 2026
Abstract:Multivector retrieval models achieve state-of-the-art effectiveness through fine-grained token-level representations, but their deployment incurs substantial computational and memory costs. Current solutions, based on the well-known k-means clustering algorithm, group similar vectors together to enable both effective compression and efficient retrieval. However, standard k-means scales poorly with the number of clusters and dataset size, and favours frequent tokens during training while underrepresenting rare, discriminative ones. In this work, we introduce TACHIOM, a multivector retrieval system that exploits token-level structure to significantly accelerate both clustering and retrieval. By accounting for tokens’ distribution during centroid allocation, TACHIOM easily scales to millions of centroids, enabling highly accurate document scoring using only centroids, avoiding expensive token-level computation. TACHIOM combines a graph-based index over centroids with an optimized Product Quantization layout for efficient final scoring. Experiments on MS-MARCOv1 and LoTTE show that TACHIOM achieves up to 247\times faster clustering than k-means and up to 9.8\times retrieval speedup over state-of-the-art systems while maintaining comparable or superior effectiveness.
[IR-1] Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在Text-to-SQL生成任务中部署时面临的挑战,尤其是在复杂或未见过的数据库模式下,由于准确率不稳定及生成无效SQL语句所带来的风险。其核心解决方案是提出Template Constrained Decoding (TeCoD),关键在于利用标注工作负载中查询模式的重复性,将历史自然语言到SQL(NL-SQL)配对转换为可复用模板,并引入一个基于微调自然语言推理模型的模板选择模块以高效匹配或排除查询;随后通过一种新颖的分段式语法约束解码策略,在保证SQL语法有效性的同时提升生成效率,从而显著提高执行准确率并降低延迟。
链接: https://arxiv.org/abs/2604.28028
作者: Smit Jivani,Sarvam Maheshwari,Sunita Sarawagi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Project Code: this https URL
Abstract:Large language models (LLMs) have revolutionized Text-to-SQL generation, allowing users to query structured data using natural language with growing ease. Yet, real-world deployment remains challenging, especially in complex or unseen schemas, due to inconsistent accuracy and the risk of generating invalid SQL. We introduce Template Constrained Decoding (TeCoD), a system that addresses these limitations by harnessing the recurrence of query patterns in labeled workloads. TeCoD converts historical NL-SQL pairs into reusable templates and introduces a robust template selection module that uses a fine-tuned natural language inference model to match or reject queries efficiently. Once the template is selected, TeCoD enforces it during SQL generation through grammar-constrained decoding, implemented via a novel partitioned strategy that ensures both syntactic validity and efficiency. Together, these components yield up to 36% higher execution accuracy than in-context learning (ICL) and 2.2x lower latency on matched queries.
[IR-2] SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions
【速读】:该论文旨在解决交互式信息检索(Interactive Information Retrieval, IIR)领域中用户模拟器(User Simulator)评估缺乏标准化工具的问题。当前研究常将行为真实性(behavioral realism,即模拟行为与真实用户行为的匹配度)与测试可靠性(tester reliability,即模拟器能否稳定生成有效的系统排名)混为一谈,而二者在理论上独立且可能存在冲突。论文提出 SimEval-IR,一个开源工具包和基准套件,其关键创新在于:(1) 定义统一的会话结构以整合会话搜索与对话式交互,并提供验证过的数据适配器和显式的损失计算机制;(2) 设计三个可执行基准,分别评估行为真实性、基于 RATE 方法的测试可靠性以及两者之间的关联性分析;(3) 在四个真实数据集(涵盖两种语言和四种模拟器类别)上提供基线结果。核心发现表明,现有主流的行为真实性检验方法——分类器-判别器“类人度”检测(classifier-discriminator “human-likeness” check)对系统排名有效性几乎无预测能力(r=+0.09, n=48),而点击深度距离和会话嵌入的 Fréchet 距离则展现出更强的相关性(|r|=0.43 和 0.40, p≤0.005),从而揭示了更可靠的评估指标方向。
链接: https://arxiv.org/abs/2604.27878
作者: Saber Zerhoudi
机构: University of Passau (帕绍大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator ‘‘human-likeness’’ check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ( r=+0.09 , n=48 ), while marginal click-depth distance and Fréchet distance over session embeddings give a much stronger signal ( |r|=0.43 and 0.40 , p\leq0.005 ). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.
[IR-3] NeocorRAG : Less Irrelevant Information More Explicit Evidence and More Effective Recall via Evidence Chains WWW2026
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中一个关键问题:尽管检索性能(如 Recall@5)不断提升,但下游推理准确率并未同步显著提升,即检索质量与推理效果之间存在脱节。作者指出,现有方法普遍忽视了对检索质量的系统性优化,导致在召回率和检索质量之间形成权衡困境。解决方案的关键在于提出一套全面的检索质量优化准则,并设计了 NeocorRAG 框架,通过系统性挖掘和利用证据链(Evidence Chains)实现检索质量的协同优化:首先采用创新的激活搜索算法构建精炼候选空间,再通过约束解码确保精确证据链生成,最终以生成的证据链指导检索优化过程,从而在保持高召回率的同时大幅提升推理准确性。该方法无需训练,且在多个基准测试上实现了 SOTA 性能,同时token消耗低于同类方法的20%。
链接: https://arxiv.org/abs/2604.27852
作者: Shiyao Peng,Qianhe Zheng,Zhuodi Hao,Zichen Tang,Rongjin Li,Qing Huang,Jiayu Huang,Jiacheng Liu,Yifan Zhu,Haihong E
机构: Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to WWW 2026
Abstract:Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at this https URL.
[IR-4] How Generative AI Disrupts Search: An Empirical Study of Google Search Gemini and AI Overviews SIGIR2026 SIGIR
【速读】:该论文旨在解决生成式 AI(Generative AI)在网页搜索中的整合如何改变传统搜索引擎的信息检索与呈现方式这一问题。其核心发现表明,生成式搜索(如 Google 的 AI Overview, AIO)相较于传统搜索,在结果来源、一致性及对网站可见性的影响上存在显著差异:AIO 更倾向于展示谷歌自有内容而非主流机构网站,且对被屏蔽爬虫的网站检索率明显降低;同时,AIO 对相同查询的响应一致性较差,且对查询微小改动敏感度更高。解决方案的关键在于构建一个包含 11,500 条真实用户查询的公开基准数据集,并通过对比传统搜索、AIO 和 Gemini Flash 2.5 的结果,系统揭示生成式搜索带来的结构性变化,从而为未来研究生成式引擎优化(Generative Engine Optimization, GEO)和生态可持续性提供实证基础。
链接: https://arxiv.org/abs/2604.27790
作者: Riley Grossman,Songjiang Liu,Michael K. Chen,Mike Smith,Cristian Borcea,Yi Chen
机构: New Jersey Institute of Technology (新泽西理工学院); Nanyang Technological University (南洋理工大学); Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Paper Accepted to ACM SIGIR 2026 (49th International ACM SIGIR Conference on Research and Development in Information Retrieval)
Abstract:Generative AI is being increasingly integrated into web search for the convenience it provides users. In this work, we aim to understand how generative AI disrupts web search by retrieving and presenting the information and sources differently from traditional search engines. We introduce a public benchmark dataset of 11,500 user queries to support our study and future research of generative search. We compare the search results returned by Google’s search engine, the accompanying AI Overview (AIO), and Gemini Flash 2.5 for each query. We have made several key findings. First, we find that for 51.5% of representative, real-user queries, AIOs are generated, and are displayed above the organic search results. Controversial questions frequently result in an AIO. Second, we show that the retrieved sources are substantially different for each search engine (0.2 average Jaccard similarity). Traditional Google search is significantly more likely to retrieve information from popular or institutional websites in government or education, while generative search engines are significantly more likely to retrieve Google-owned content. Third, we observe that websites that block Google’s AI crawler are significantly less likely to be retrieved by AIOs, despite having access to the content. Finally, AIOs are less consistent when processing two runs of the same query, and are less robust to minor query edits. Our findings have important implications for understanding how generative search impacts website visibility, the effectiveness of generative engine optimization techniques, and the information users receive. We call for revenue frameworks to foster a sustainable and mutually beneficial ecosystem for publishers and generative search providers.
[IR-5] Position-Aware Drafting for Inference Acceleration in LLM -Based Generative List-Wise Recommendation
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的生成式列表推荐(Generative List-wise Recommendation)中解码过程串行导致延迟高的问题。现有推测解码(Speculative Decoding, SD)方法在推荐场景下因未考虑每个项目由多个语义ID标记组成且存在分隔符,且对所有token一视同仁,从而限制了加速效果。解决方案的关键在于提出PAD-Rec(Position-Aware Drafting for Generative Recommendation),其通过引入两个互补信号增强小模型的草案生成能力:一是项目位置嵌入(Item position embeddings),显式编码token在其所属item内的槽位信息,提升结构感知;二是步骤位置嵌入(Step position embeddings),捕捉推测深度带来的不确定性变化,使模型能自适应调整提案质量。此外,通过可学习的门控机制融合这些信号与基础特征,实现轻量级、易集成且推理开销几乎不变的改进,在四个真实数据集上实现了最高达3.1倍的墙钟时间加速,并较强基线平均提升约5%的速度收益,同时保持推荐质量稳定。
链接: https://arxiv.org/abs/2604.27747
作者: Jiaju Chen,Chongming Gao,Chenxiao Fan,Haoyan Liu,Qingpeng Cai,Peng Jiang,Xiangnan He
机构: University of Science and Technology of China(中国科学技术大学); Zhongguancun Academy(中关村学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token’s semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD’s speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.
[IR-6] One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness ACL2026
【速读】:该论文旨在解决跨模态嵌入空间中的“hubness问题”(hubness problem),即某些嵌入向量(hub embeddings)在高维空间中与大量无关样本距离过近,从而对信息检索和自动评估指标等应用造成实际威胁。针对这一问题,论文提出了一种识别hub嵌入及其对应hub文本的方法,其关键在于通过分析跨模态相似性得分的异常分布,定位出一个单一的hub文本,该文本在多个图像上能获得与人类撰写参考描述相当或更高的相似度分数,从而揭示跨模态编码器在语义对齐上的脆弱性。
链接: https://arxiv.org/abs/2604.27674
作者: Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: Accepted at ACL2026 (main)
Abstract:The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.
[IR-7] Purifying Multimodal Retrieval: Frag ment-Level Evidence Selection for RAG
【速读】:该论文旨在解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MRAG)框架中因将检索到的证据视为不可分割的整体文档而导致的噪声干扰问题。现有方法假设文档内所有内容均等有用,但在实际应用中,仅部分片段与查询相关,其余内容可能引入冗余甚至误导信息,从而降低生成质量。解决方案的关键在于将MRAG重构为细粒度证据选择问题:提出FES-RAG(Fragment-level Evidence Selection for RAG)框架,通过分解多模态文档为文本级句子片段和视觉级区域片段,实现原子级别证据的选择;并引入片段信息增益(Fragment Information Gain, FIG)作为衡量每个片段对模型生成置信度边际贡献的指标,据此训练轻量级选择器,以低推理开销精准筛选高价值片段,显著提升生成准确性与连贯性,同时减少上下文长度。
链接: https://arxiv.org/abs/2604.27600
作者: Xihang Wang,Zihan Wang,Chengkai Huang,Cao Liu,Ke Zeng,Quan Z. Sheng,Lina Yao
机构: Zhejiang University (浙江大学); Meituan LongCat Interaction Team (美团长猫互动团队); University of New South Wales (新南威尔士大学); Macquarie University (麦考瑞大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Multimodal Retrieval-Augmented Generation (MRAG) is widely adopted for Multimodal Large Language Models (MLLMs) with external evidence to reduce hallucinations. Despite its success, most existing MRAG frameworks treat retrieved evidence as indivisible documents, implicitly assuming that all content within a document is equally informative. In practice, however, sometimes only a small fraction of a document is relevant to a given query, while the remaining content introduces substantial noise that may lead to performance degradation. We address this fundamental limitation by reframing MRAG as a fine-grained evidence selection problem. We propose Fragment-level Evidence Selection for RAG (FES-RAG), a framework that selects atomic multimodal fragments rather than entire documents as grounding evidence. FES-RAG decomposes retrieved multimodal documents into sentence-level textual fragments and region-level visual fragments, enabling precise identification of evidence that directly supports generation. To guide fragment selection, we introduce Fragment Information Gain (FIG), a principled metric that measures the marginal contribution of each fragment to the MLLM’s generation confidence. Based on FIG, we distill fragment-level utility judgments from a high-capacity MLLM into a lightweight selector, achieving accurate evidence selection with low inference overhead. Experiments on the M2RAG benchmark show that FES-RAG consistently outperforms state-of-the-art document-level MRAG methods, achieving up to 27 percent relative improvement in CIDEr. By selecting fewer yet more informative fragments, our approach substantially reduces context length while improving factual accuracy and generation coherence.
[IR-8] One Pass Any Order: Position-Invariant Listwise Reranking for LLM -Based Recommendation SIGIR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推荐系统重排序任务中因候选集排列顺序敏感而导致的排名不稳定问题。具体而言,decoder-only架构的LLM在处理列表时会受到输入序列顺序的影响,即相同候选集的不同排列可能导致不同的评分和最终排序,这与推荐系统本质上关注集合属性(set-based nature)相悖。解决方案的关键在于提出InvariRank框架,其核心创新包括:通过结构化注意力掩码阻断跨候选项注意力机制以实现排列不变性,并借助旋转位置编码(Rotary Positional Embeddings, RoPE)下的共享位置框架消除位置信息引发的评分偏移;同时结合列表级学习排序目标,在单次前向传播中完成所有候选项评分,避免了传统方法依赖多轮排列训练的复杂性。实验表明,InvariRank在保持竞争性排序效果的同时显著提升了排名稳定性。
链接: https://arxiv.org/abs/2604.27599
作者: Ethan Bito,Yongli Ren,Estrid He
机构: RMIT University(皇家墨尔本理工大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at SIGIR 2026
Abstract:Large language models (LLMs) are increasingly used for recommendation reranking, but their listwise predictions can depend on the order in which candidates are presented. This creates a mismatch between the set-based nature of recommendation and the sequence-based computation of decoder-only LLMs, where permuting an otherwise identical candidate set can change item scores and final rankings. Such order sensitivity makes LLM-based rerankers difficult to rely on, since rankings may reflect prompt serialization rather than user preference. We propose InvariRank, a permutation-invariant listwise reranking framework that addresses this dependence at the architectural level. InvariRank blocks cross-candidate attention with a structured attention mask and negates position-induced scoring changes through shared positional framing under Rotary Positional Embeddings (RoPE). Combined with a listwise learning-to-rank objective, the model scores all candidates in a single forward pass, avoiding permutation-based invariance training objectives that require multiple permutations of a candidate set. Experiments on recommendation benchmarks show that InvariRank maintains competitive ranking effectiveness while producing stable rankings across candidate permutations. The results suggest that architectural invariance is a practical route to reliable and efficient LLM-based recommendation reranking. The source code is at this https URL.
[IR-9] Reproducing Adaptive Reranking for Reasoning -Intensive IR
【速读】:该论文旨在解决传统检索-重排序(retrieve–rerank)流水线中存在的召回上限(bounded recall)问题,该问题源于第一阶段检索器的能力限制。现有方法通常通过改进第一阶段检索器来提升召回率,但这种方法在处理需要复杂推理的查询时会带来显著的训练和推理开销。本文的关键解决方案是复现并验证Graph-based Adaptive Reranking (GAR) 方法在BRIGHT这一推理密集型检索基准上的有效性,GAR通过在文档图中迭代探索的方式修改重排序过程,而非依赖更强的检索器;实验表明,重排序模型提供的信号质量对识别额外相关文档至关重要,且GAR能在几乎不增加计算负担的前提下显著提升推理密集型任务的检索效果,从而实现更实用的检索系统部署。
链接: https://arxiv.org/abs/2604.27577
作者: Mandeep Rathee,V Venktesh,Sean MacAvaney,Avishek Anand
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 7 figures, 11 pages
Abstract:The classical cascading pipeline of retrieve–rerank suffers from a bounded recall problem, stemming from limitations of the first-stage retriever. Most current approaches address the bounded recall problem by improving the first-stage retriever, but this incurs substantial training and inference costs, especially to handle queries that require substantial reasoning. To circumvent the computational costs of reasoning-based retrievers, we replicate the findings of GAR, Graph-based Adaptive Reranking, on the BRIGHT reasoning-intensive retrieval benchmark. GAR addresses the bounded recall problem by modifying the reranking process itself through iterative exploration of a corpus graph, but it was previously only tested on models designed for topical and question-answering-style queries. Hence, reproduce GAR in reasoning-intensive settings with reasoning and non-reasoning reranking models. We observe that the quality of the reranker’s signal plays an important role in identifying additional relevant documents within the corpus graph. Overall, we find that GAR boosts the effectiveness of reasoning-intensive retrieval across a variety of models while contributing minimally to computational overheads. Ultimately, this work enables more practical deployment of retrieval systems that can address reasoning-intensive queries.
[IR-10] A Reproducibility Study of LLM -Based Query Reformulation
【速读】:该论文旨在解决当前关于大语言模型(Large Language Models, LLMs)在信息检索中用于查询改写(query reformulation)的研究结果缺乏可复现性的问题。由于现有研究多在异构实验条件下进行,导致难以判断哪些性能提升是稳定可靠的,哪些依赖于特定实现细节。为此,作者构建了一个统一且严格受控的实验框架,系统性地比较了十种代表性LLM-based查询改写方法,涵盖两类LLM架构、两种参数规模、三种检索范式(词法检索、学习稀疏检索和稠密检索),并在九个基准数据集(包括TREC Deep Learning和BEIR)上进行评估。其关键解决方案在于通过标准化的实验设计与公开工具链(QueryGym),实现了方法的透明复现与持续对比,从而揭示了改写效果对检索范式的强依赖性,以及大模型规模与下游性能之间并非线性正相关的结论,为后续研究提供了可信赖的基准和方向。
链接: https://arxiv.org/abs/2604.27421
作者: Amin Bigdeli,Radin Hamidi Rad,Hai Son Le,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri
机构: University of Waterloo (滑铁卢大学); Mila – Quebec AI Institute (魁北克人工智能研究所); Toronto Metropolitan University (多伦多理工大学); University of Toronto (多伦多大学); University of California, Berkeley (加州大学伯克利分校)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnotethis https URL
[IR-11] From Unstructured to Structured: LLM -Guided Attribute Graphs for Entity Search and Ranking
【速读】:该论文旨在解决电子商务场景下实体搜索(Entity Search)中因品类和上下文差异导致的产品相似性难以准确建模的问题,传统基于嵌入的方法往往无法捕捉细粒度的、上下文相关的属性相关性。其解决方案的关键在于提出一种两阶段框架:首先在离线阶段利用大语言模型(Large Language Model, LLM)从非结构化文本中提取结构化产品属性,并构建具有品类感知(category-aware)模式的可复用属性图;其次在在线阶段通过基于图结构的LLM推理对候选实体进行排序,而非直接处理原始文本,从而将每条产品的token消耗降低57%,同时显著提升排序精度,在零样本场景下平均精度提升超过5%,且跨品类泛化能力强,具备实际部署潜力。
链接: https://arxiv.org/abs/2604.27410
作者: Yilun Zhu,Nikhita Vedula,Shervin Malmasi
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Entity search, i.e., finding the most similar entities to a query entity, faces unique challenges in e-commerce, where product similarity varies across categories and contexts. Traditional embedding-based approaches often struggle to capture nuanced context-specific attribute relevance. In this paper, we present a two-stage approach combining Large Language Model (LLM)-driven attribute graph construction with graph-aware LLM ranking. In the offline stage, we extract structured product attributes from unstructured text, and construct a reusable attribute graph with category-aware schemas. In the online stage, we rank retrieved candidates by reasoning over this structured representation rather than raw text, reducing per-product token usage by 57% while improving ranking precision. Experiments show that our approach outperforms multiple baselines under zero-shot scenarios, achieving a over 5% improvement in average precision without requiring training data, generalizes robustly across diverse product categories, and shows immense potential for real-world deployment.
[IR-12] oward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection Query Generation and Resolution in Security Operations
【速读】:该论文旨在解决安全运营中心(Security Operations Center, SOC)在应对海量威胁情报、异构SIEM平台以及耗时的手动事件分析流程中所面临的 operational challenges(操作挑战)。其核心解决方案是构建一个端到端的威胁管理框架,关键在于三个模块的协同:一是基于集成学习的检测模块,融合多个大语言模型(Large Language Models, LLMs)以提升检测准确率并控制误报率;二是引入SQM(Syntax Query Metadata)架构,通过语法约束、元数据检索与文档引导提示机制自动生成适用于IBM QRadar和Google SecOps的可执行查询语句,显著优于基线LLM性能;三是利用SQM提取的证据增强事件处置建议生成,使代码预测准确率从78.3%提升至90.0%,整体推荐质量达8.70分。该框架在实际SOC环境中将平均事件研判时间从数小时缩短至10分钟以内,验证了领域约束下的LLM架构结合检索增强技术可在高可靠性与高效率要求下实现规模化部署。
链接: https://arxiv.org/abs/2604.27321
作者: Md Hasan Saju,Akramul Azim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Security Operations Centers (SOCs) face mounting operational challenges. These challenges come from increasing threat volumes, heterogeneous SIEM platforms, and time-consuming manual triage workflows. We present an end-to-end threat management framework that integrates ensemble-based detection, syntax-constrained query generation, and retrieval-augmented resolution support to automate critical security workflows. Our detection module evaluates both traditional machine learning classifiers and large language models (LLMs), then combines the three best-performing LLMs to create an ensemble model, achieving 82.8% accuracy while maintaining 0.120 false positive rate on SIEM logs. We introduce the SQM (Syntax Query Metadata) architecture for automated evidence collection. It uses platform-specific syntax constraints, metadata-based retrieval, and documentation-grounded prompting to generate executable queries for IBM QRadar and Google SecOps. SQM achieves a BLEU score of 0.384 and a ROUGE-L score of 0.731. These results are more than twice as good as the baseline LLM performance. For incident resolution and recommendation generation, we demonstrate that integrating SQM-derived evidence improves resolution code prediction accuracy from 78.3% to 90.0%, with an overall recommendation quality score of 8.70. In production SOC environments, our framework reduces average incident triage time from hours to under 10 minutes. This work demonstrates that domain-constrained LLM architectures with retrieval augmentation can meet the strict reliability and efficiency requirements of operational security environments at scale.
[IR-13] NuggetIndex: Governed Atomic Retrieval for Maintainable RAG
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在评估与实际检索对象之间存在“单元不匹配”问题,即标准实现通常检索文档片段或静态命题,无法有效处理知识演化、过时信息及来源冲突,从而影响系统在动态语料中的可靠性与准确性。解决方案的关键在于提出 NuggetIndex,一种以原子化信息单元(称为 nuggets)为存储单位的检索系统,每个 nugget 记录维护证据链接、时间有效性区间和生命周期状态;通过在排序前过滤无效或废弃 nugget,确保仅使用时效性强且一致的信息,从而提升检索的时效性与准确性。实验表明,相比基于段落和未管理命题的基线方法,NuggetIndex 在 nugget 召回率上提升 42%,时间正确率提高 9 个百分点,冲突率降低 55%,同时因紧凑的 nugget 格式减少生成器输入长度 64%,适合轻量级部署。
链接: https://arxiv.org/abs/2604.27306
作者: Saber Zerhoudi,Michael Granitzer,Jelena Mitrovic
机构: University of Passau (帕绍大学); Interdisciplinary Transformation University Austria (跨学科转型大学奥地利)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-augmented generation (RAG) systems are frequently evaluated via fact-based metrics, yet standard implementations retrieve passages or static propositions. This unit mismatch between evaluation and retrieval objects hinders maintenance when corpora evolve and fails to capture superseded facts or source disagreements. We propose NuggetIndex, a retrieval system that stores atomic information units as managed records, so called nuggets. Each record maintains links to evidence, a temporal validity interval, and a lifecycle state. By filtering invalid or deprecated nuggets prior to ranking, the system prevents the inclusion of outdated information. We evaluate the approach using a nuggetized MS MARCO subset, a temporal Wikipedia QA dataset, and a multi-hop QA task. Against passage and unmanaged proposition retrieval baselines, NuggetIndex improves nugget recall by 42%, increases temporal correctness by 9 percentage points without the recall collapse observed in time-filtered baselines, and reduces conflict rates by 55%. The compact nugget format reduces generator input length by 64% while enabling lightweight index structures suitable for browser-based and resource-constrained deployment. We release our implementation, datasets, and evaluation scripts
[IR-14] RAQG-QPP: Query Performance Prediction with Retrieved Query Variants and Retrieval Augmented Query Generation
【速读】:该论文旨在解决**查询性能预测(Query Performance Prediction, QPP)**在神经排序模型(neural rankers)上表现不佳的问题,尤其是现有无监督QPP方法在处理复杂语义匹配时效果有限。其核心挑战在于如何有效利用查询变体(Query Variants, QVs)来提升预测准确性,而传统基于词项扩展生成的QVs常导致不连贯、幻觉或偏离主题的问题。解决方案的关键在于:首先,从历史查询日志中检索与当前查询语义相近的查询作为QVs;其次,进一步引入大语言模型(Large Language Models, LLMs)对这些检索到的QVs进行条件化生成,从而增强QVs的质量和多样性,最终显著提升QPP在神经排序模型上的性能——实验表明,所提出的RAQG方法相比最优现有QV-based方法在MonoT5等模型上性能提升高达30%。
链接: https://arxiv.org/abs/2604.27244
作者: Fangzheng Tian,Debasis Ganguly,Craig Macdonald
机构: University of Glasgow(格拉斯哥大学)
类目: Information Retrieval (cs.IR)
备注: Accepted manuscript. 27 pages, 8 figures, 5 tables. To appear in ACM Transactions on Information Systems
Abstract:Query Performance Prediction (QPP) estimates the retrieval quality of ranking models without the use of any human-assessed relevance judgements, and finds applications in query-specific selective decision making to improve overall retrieval effectiveness. Although unsupervised QPP approaches are effective for lexical retrieval models, they usually perform weaker for neural rankers. Recent work shows that leveraging query variants (QVs), i.e., queries with potentially similar information needs to a given query, can enhance unsupervised QPP accuracy. However, existing QV-based prediction methods rely on query variants generated by term expansion of the input query, which is likely to yield incoherent, hallucinatory and off-topic QVs. In this paper, we propose to make use of queries retrieved from a log of past queries as QVs to be subsequently used for QPP. In addition to directly applying retrieved QVs in QPP, we further propose to leverage large language models (LLMs) to generate QVs conditioned on the retrieved QVs, thus mitigating the limitation of relying only on existing queries in a log. Experiments on TREC DL’19 and DL’20 show that QPP enhanced with RAQG outperform the best-performing existing QV-based prediction approach by as much as 30% on neural ranking models such as MonoT5.
[IR-15] LLM -Enhanced Topical Trend Detection at Snapchat
【速读】:该论文旨在解决在大规模短视频社交平台(如Snapchat)上自动识别新兴话题趋势的挑战,以维持内容生态系统的动态性和时效性。其核心解决方案在于构建一个端到端的系统,整合多模态话题提取(multimodal topic extraction)、时间序列突发检测(time-series burst detection)以及基于大语言模型(LLM)的归纳与增强(LLM-based consolidation and enrichment),从而实现高精度和低延迟的趋势发现。该方法首次在生产规模上实现了对短视频平台话题趋势的自动化检测,并已在全球部署,显著提升了内容新鲜度和用户体验。
链接: https://arxiv.org/abs/2604.27131
作者: Hangqi Zhao,Jay Li,Abhiruchi Bhattacharya,Cong Ni,Jason Yeung,Jinchao Ye,Kai Yang,Akshat Malu,Manish Malik
机构: Snap Inc.(Snap Inc.)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Automatic detection of topical trends at scale is both challenging and essential for maintaining a dynamic content ecosystem on social media platforms. In this work, we present a large-scale system for identifying emerging topical trends on Snapchat, one of the world’s largest short-video social platforms. Our system integrates multimodal topic extraction, time-series burst detection, and LLM-based consolidation and enrichment to enable accurate and timely trend discovery. To the best of our knowledge, this is the first published end-to-end system for topical trend detection on short-video platforms at production scale. Continuous offline human evaluation over six months demonstrates high precision in identifying meaningful trends. The system has been deployed in production at global scale and applied to downstream surfaces including content ranking and search, driving measurable improvements in content freshness and user experience.
[IR-16] A Gated Hybrid Contrastive Collaborative Filtering Recommendation
【速读】:该论文旨在解决现有推荐系统中评论感知模型在评分预测任务上优化过度,而忽视排序质量的问题,这限制了其在Top-N推荐场景下的性能表现。解决方案的关键在于提出一种基于门控混合协同过滤(Gated Hybrid Collaborative Filtering)的框架,通过自适应门控机制在编码层逐层融合语义特征与协同过滤嵌入,并引入对比学习模块以对齐语义与协同信号,从而优化潜在空间中的项目排序行为;此外,模型采用成对贝叶斯个性化排序目标进行训练,显式提升相关与非相关项目的分离度,实验表明该方法在多个数据集上显著优于当前最优的评论感知基线模型。
链接: https://arxiv.org/abs/2604.27117
作者: Eduardo Ferreira da Silva,Mayki dos Santos Oliveira,Joel Machado Pires,Denis Dantas Boaventura,Maycon Maciel Peixoto,Cassio Serafim Prazeres,Gustavo Bittencourt Figueiredo,Miriam Capretz,Frederico Araujo Durão
机构: Universidade Federal da Bahia (巴西联邦大学); University of Western Ontario (西安大略大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommender systems increasingly incorporate textual reviews to enrich user and item representations. However, most review-aware models remain optimized for rating prediction rather than ranking quality. This misalignment limits their effectiveness in top-N recommendation scenarios, where discriminative ranking is essential. To address this gap, we propose a Gated Hybrid Collaborative Filtering framework that integrates review-derived representations into an autoencoder-based collaborative model. The architecture injects semantic signals layer-wise through an adaptive gating mechanism that dynamically balances collaborative embeddings and topic-based features during encoding. To further refine the latent space, we introduce a contrastive learning module that aligns semantic and collaborative signals. We evaluate the framework across five distinct configurations: Pure collaborative; Topic and Gated; Text and Gated; and the addition of contrastive objectives (Contrastive and Topic, and Contrastive and Text). To explicitly optimize ranking behavior, the model is trained with a pairwise Bayesian personalized ranking objective, which promotes separation between relevant and non-relevant items in the latent space. Experiments on Amazon Movies TV, IMDb, and Rotten Tomatoes demonstrate consistent improvements in hit rate @10 and normalized discounted cumulative gain @10 over state-of-the-art review-aware baselines. Results highlight the importance of controlled semantic fusion for ranking-driven recommendation.
[IR-17] Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval SIGIR2026
【速读】:该论文旨在解决传统双编码器(bi-encoder)在文档检索任务中因固定内积(inner-product)评分函数导致的表达能力不足问题,同时保持查询与文档独立编码的优势。其解决方案的关键在于引入一种基于超网络(hypernetwork)的可变评分机制——即“Hypencoder”框架,其中查询特定的神经网络(q-net)的权重由超网络根据上下文化查询嵌入动态生成,从而实现更灵活、更具表达力的相关性估计,且无需改变原有的双编码结构。
链接: https://arxiv.org/abs/2604.27037
作者: Arne Eichholtz,Yongkang Li,Jutte Vijverberg,Tobias Groot,Mohammad Aliannejadi
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: This paper has been accepted as a reproducibility paper at SIGIR 2026
Abstract:The Hypencoder, proposed by Killingback et al., is a retrieval framework that replaces the fixed inner-product scoring function used in standard bi-encoders with a query-specific neural network (the q -net), whose weights are generated by a hypernetwork from the contextualized query embeddings. This design enables more expressive relevance estimation while preserving independent query and document encoding. In this work, we conduct a reproducibility study of the Hypencoder and extend the original analysis in three directions. Our reproduction confirms that the Hypencoder outperforms a similarly trained bi-encoder baseline on in-domain and out-of-domain benchmarks, and that the proposed efficient search algorithm substantially reduces query latency with minimal performance loss. On hard retrieval tasks, we find partial support: the Hypencoder outperforms the baseline on DL-Hard and FollowIR, but not on TREC TOT, where checkpoint incompatibility and fine-tuning sensitivity complicate full verification. Beyond reproduction, we investigate three extensions: (i)~integrating alternative pre-trained encoders into the Hypencoder framework, where we find that performance gains depend on the encoder and fine-tuning strategy; (ii)~comparing query latency against a Faiss-based bi-encoder pipeline, revealing that standard bi-encoder retrieval remains faster under both exhaustive and efficient search settings; and (iii)~evaluating adversarial robustness, where we find that the q -net’s non-linear scoring does not provide a consistent robustness disadvantage over inner-product scoring. Our code is publicly available at this https URL.
[IR-18] LUCid: Redefining Relevance For Lifelong Personalization
【速读】:该论文旨在解决当前终身个性化(lifelong personalization)系统中因依赖语义相近性来定义相关性而导致的局限性,即这些系统往往无法捕捉来自主题无关交互的关键用户信息。其解决方案的核心是提出LUCid基准,这是一个包含1,936个真实查询及其最多500个会话的历史交互数据的评测平台,用于衡量情境化用户中心相关性(situational user-centric relevance)。通过在多种模型架构上的实验表明,当需要从语义距离较远的历史中提取相关信息时,检索召回率急剧下降至接近零,响应一致性也仅维持在约50%,揭示了现有系统对相关性的编码与个性化所需的场景相关性之间存在根本性错位。LUCid为系统性评估模型是否能从过往交互中挖掘情境相关的信息提供了工具,并推动个性化向以用户为中心的相关性重新对齐。
链接: https://arxiv.org/abs/2604.26996
作者: Chimaobi Okite,Anika Misra,Joyce Chai,Rada Mihalcea
机构: University of Michigan (密歇根大学)
类目: Information Retrieval (cs.IR)
备注: first version
Abstract:Current approaches to lifelong personalization operationalize relevance through semantic proximity, causing them to miss essential user information from topically unrelated interactions. To address this gap, we introduce LUCid, a benchmark designed to measure situational user-centric relevance in personalization. The benchmark consists of 1,936 realistic queries paired with interaction histories from up to 500 sessions. Across multiple architectures, our experiments show significant performance collapse when relevant context must be surfaced from semantically distant history: retrieval recall drops to near zero on the hardest instances, and response alignment remains near 50% even for state-of-the-art models such as Gemini-3-Flash, GPT-5.4, and Claude Haiku. These results expose a fundamental mismatch between the notion of relevance encoded by current systems and the situational relevance required for personalization, with direct implications for robustness and safety when critical user attributes remain undetected. LUCid enables the systematic evaluation of whether current models can surface situationally-relevant user information from previous interactions, and serves as a step toward realigning personalization with user-centered relevance.
[IR-19] Value-Aware Product Recommendation by Customer Segmentation using a suitable High-Dimensional Similarity Measure
【速读】:该论文旨在解决产品推荐中因用户-物品数据高维度和稀疏性导致的推荐精度下降问题,同时在推荐过程中未能显式考虑每个用户和商品对整体销售额的贡献。解决方案的关键在于构建一个价值感知(value-aware)的推荐框架,通过将收入贡献编码到用户-物品矩阵中,并基于此矩阵使用合适的距离度量直接计算用户相似性,从而实现按收益导向的用户分群与个性化推荐。该方法支持三种基于收入占比、商品流行度和预期利润生成的推荐策略,有效提升了推荐结果与商业目标(如盈利能力)的一致性。
链接: https://arxiv.org/abs/2604.26983
作者: María Florencia Acosta,Rodrigo García Arancibia,Pamela Llop,Mariel Lovatto,Lucas Mansilla
机构: Santafe-Conicet (圣菲-国家科学委员会); Universidad Nacional del Litoral (国立大学); Facultad de Ingeniería Química (化学工程学院); Instituto Sinc (同步研究所)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:This paper presents a novel value-aware approach to product recommendation that simultaneously addresses the high dimensionality and sparsity of user-item data while explicitly incorporating the contribution of each product and user to overall sales revenue. The proposed framework encodes revenue contributions in the user-item matrix and computes customer similarity directly on this basis using suitable distance measures. This enables the segmentation of users according to the revenue-based similarity of their purchase baskets and supports recommendations aligned with profitability objectives. We compare conventional similarity metrics with a novel alternative tailored to high-dimensional contexts and propose three recommendation strategies based on revenue share, product popularity, and expected profit generation. The effectiveness of the proposed method is validated through simulation experiments and a real-world application using the UCI Online Retail dataset.
[IR-20] Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)服务中存在成本不透明、效率低下以及资源利用率不足的问题,尤其是RAG-as-a-Service(RaaS)模型按请求次数计费而未考虑检索到的文本片段(chunk)的相关性与质量,导致用户支付高昂费用却未必获得高价值输出。其解决方案的关键在于提出一种新型服务模式——Chunk-as-a-Service(CaaS),并引入“效用-成本在线选择算法”(Utility-Cost Online Selection Algorithm, UCOSA),该算法能够在预算约束下动态评估并选择最具效用的提示(prompt)进行文本块增强,从而显著提升单位预算下的内容相关性和整体性能。实验表明,UCOSA在性能指标(即增强提示数 × 平均相关性)上比随机选择高出约52%,且在预算利用效率方面较RaaS分别提升了140%(LB-CaaS)和86%(OB-CaaS)。
链接: https://arxiv.org/abs/2604.26981
作者: Shawqi Al-Maliki,Ammar Gharaibeh,Mohamed Rahouti,Mohammad Ruhul Amin,Mohamed Abdallah,Junaid Qadir,Ala Al-Fuqaha
机构: Hamad Bin Khalifa University (哈马德本哈利法大学); German Jordanian University (德国约旦大学); Fordham University (福特汉姆大学); Qatar University (卡塔尔大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have revolutionized the field of natural language processing. However, they exhibit some limitations, including a lack of reliability and transparency: they may hallucinate and fail to provide sources that support the generated output. Retrieval-Augmented Generation (RAG) was introduced to address such limitations in LLMs. One popular implementation, RAG-as-a-Service (RaaS), has shortcomings that hinder its adoption and accessibility. For instance, RaaS pricing is based on the number of submitted prompts, without considering whether the prompts are enriched by relevant chunks, i.e., text segments retrieved from a vector database, or the quality of the utilized chunks (i.e., their degree of relevance). This results in an opaque and less cost-effective payment model. We propose Chunk-as-a-Service (CaaS) as a transparent and cost-effective alternative. CaaS includes two variants: Open-Budget CaaS (OB-CaaS) and Limited-Budget CaaS (LB-CaaS), which is enabled by our ``Utility-Cost Online Selection Algorithm (UCOSA)‘’. UCOSA further extends the cost-effectiveness and the accessibility of the OB-CaaS variant by enriching, in an online manner, a subset of the submitted prompts based on budget constraints and utility-cost tradeoff. Our experiments demonstrate the efficacy of the proposed UCOSA compared to both offline and relevance-greedy selection baselines. In terms of the performance metric-the number of enriched prompts (NEP) multiplied by the Average Relevance (AR)-UCOSA outperforms random selection by approximately 52% and achieves around 75% of the performance of offline selection methods. Additionally, in terms of budget utilization, LB-CaaS and OB-CaaS achieve higher performance-to-budget ratios of 140% and 86%, respectively, compared to RaaS, indicating their superior efficiency.
[IR-21] 2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language
【速读】:该论文旨在解决知识图谱问答(Question Answering over Knowledge Graphs, QAKG)系统评估中存在的碎片化、不一致性和可复现性差的问题。现有基准测试通常仅关注少数指标(如查询精确匹配或答案级别的F1值),忽视了语法有效性、语义忠实性、执行正确性、结果排序质量及计算效率等关键维度。解决方案的关键在于提出一个名为t2s-metrics的开源、可扩展且统一的评估库,其核心创新是提供超过20种来自文献和实际需求的评估指标,涵盖词法、语法、语义、结构、执行和排序等多个维度,并通过模块化抽象层将指标定义与实现解耦,从而实现SPARQL查询比较与基于执行的评估的一致性、透明性和可复现性,推动QAKG领域向系统化、标准化评估迈进。
链接: https://arxiv.org/abs/2604.26971
作者: Yousouf Taghzouti(ICN, WIMMICS, Laboratoire I3S - SPARKS),Tao Jiang(ICN),Camille Juigné(WIMMICS, Laboratoire I3S - SPARKS),Benjamin Navet(ICN, WIMMICS, Laboratoire I3S - SPARKS),Fabien Gandon(WIMMICS, Laboratoire I3S - SPARKS),Franck Michel(Laboratoire I3S - SPARKS, WIMMICS),Louis-Felix Nothias(ICN)
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation needs, spanning lexical, syntactic, semantic, structural, execution-based and ranking-based dimensions. These include query-based metrics such as token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics (SP-BLEU, SP-F1); graph-and URI-based exact match metrics; as well as answer set-based metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. Taking inspiration from the ir-metrics library for Information Retrieval, t2s-metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQLbased QA systems. We argue that t2s-metrics constitutes a necessary step toward systematic, standardized evaluation in question answering over knowledge graphs and facilitates deeper diagnostic insights into system behavior beyond answer correctness.
[IR-22] Not All Memories Age the Same: Autodiscovery of Adaptive Decay in Knowledge Graphs
【速读】:该论文旨在解决知识图谱在检索任务中对所有事实一视同仁、忽视其时间动态性的核心问题,即传统方法采用统一衰减策略(uniform decay),无法反映不同类型知识随时间变化的差异性。其关键解决方案是提出一个分层的连续衰减表面框架,通过两个正交信号——速度(velocity,衡量概念被观察的频率)和波动性(volatility,通过嵌入距离衡量值的变化幅度)——来参数化衰减过程。该框架包含三个可学习层级:领域级参数捕捉普遍模式(如某些谓词本质上持久或短暂)、上下文级参数捕获场景依赖变化、实体级参数实现个性化衰减;所有参数均基于生存分析从数据中自动推导,无需预定义分类体系或领域知识,从而更准确地识别查询时刻的重要信息。
链接: https://arxiv.org/abs/2604.26970
作者: Mandar Karhade
机构: Citingale(引用洞见)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 27 pages, 2 figures, 19 tables (including appendix). Preprint under review
Abstract:Knowledge graphs used for retrieval treat all facts as equally current. Existing temporal approaches apply uniform decay, using a single forgetting curve regardless of knowledge type. We show this is fundamentally misspecified: different knowledge types exhibit different temporal dynamics, and the core retrieval problem is not latency or throughput but identifying what is important at query time. We propose a hierarchical framework that replaces uniform decay with a continuous decay surface parameterized by two orthogonal signals: velocity (how frequently a concept is observed) and volatility (how much the value changes between observations, measured via embedding distance). The decay surface is decomposed into three learnable levels: domain-level parameters capture universal patterns (some predicates are inherently permanent, others inherently transient), context-level parameters capture setting-dependent variation, and entity-level adaptation personalizes decay to specific subjects. All parameters emerge from data through survival analysis on observed value lifetimes, requiring no predefined taxonomies or domain expertise. We formulate edge lifetime as a survival problem where the event is value supersession (a meaningfully different value replacing the current one), distinct from mere re-observation. Experiments on synthetic temporal knowledge graphs demonstrate recovery of planted hierarchical parameters (HDBSCAN ARI = 1.0). Validation on 107 Wikipedia articles and 1,163 patient records from the Synthea clinical EHR simulator shows that velocity-volatility clusters emerge naturally, align with observable persistence patterns, and near-universally exhibit the Lindy effect (Weibull shape k 1). Uniform decay performs 18x worse than no temporal weighting. Heterogeneous decay recovers from this, with each hierarchy level contributing measurable improvement. Comments: 27 pages, 2 figures, 19 tables (including appendix). Preprint under review Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) ACMclasses: H.3.3; I.2.6 Cite as: arXiv:2604.26970 [cs.IR] (or arXiv:2604.26970v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.26970 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Mandar Karhade [view email] [v1] Wed, 22 Apr 2026 02:32:01 UTC (93 KB)
[IR-23] Agent icRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
【速读】:该论文旨在解决大规模推荐系统中多阶段流水线(包括预排序、排序和重排序)的系统级配置优化难题。传统方法通常聚焦于单一模型的改进,而忽视了各阶段模型输出整合时的全局配置优化,这导致每次模型迭代都需要重新调整系统级参数,且由于不同阶段目标各异、在线指标相互竞争,优化过程复杂且依赖领域专家经验。解决方案的关键在于提出 AgenticRecTune 框架,其核心是五个专业化智能体(Actor、Critic、Insight、Skill 和 Online)协同工作:通过大语言模型(LLM)如 Gemini 的推理能力探索最优配置空间,其中 Actor 提出候选配置,Critic 过滤低效方案,Online Agent 自动执行 A/B 测试并收集结果;同时引入自演化 Skillhub,由 Insight Agent 和 Skill Agent 共同分析历史实验数据,提炼推荐系统任务背后的机制并动态更新优化技能,从而实现端到端、自动化且持续进化的配置优化流程。
链接: https://arxiv.org/abs/2604.26969
作者: Xidong Wu,Yue Zhuan,Ruoqiao Wei,Hangxin Chen,Di Bai,Jintao Liu,Xinyi Wang,Xue Wang,Luoshu Wang,Xinwu Cheng
机构: Google(谷歌)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern large-scale recommendation systems are typically constructed as multi-stage pipelines, encompassing pre-ranking, ranking, and re-ranking phases. While traditional recommendation research typically focuses on optimizing a specific model, such as improving the pre-ranking model structure or ranking models training algorithm, system-level configurations optimization play a crucial role, which integrates the output from each model head to get the final score in each stage. Due to the complexity of the system, the configuration optimization is highly important and challenging. Any model modification requires new optimal system-level configurations. But each experimental iteration requires significant tuning effort. Furthermore, models in different stage operates within a distinct context and optimizes for different targets, requiring specialized domain expertise. In addition, optimization success depends on balancing competing multiple online metrics and alignment with shifting production development objectives. To address these challenges, we propose AgenticRecTune, an agentic framework comprising five specialized agents, Actor, Critic, Insight, Skill, and Online, designed to manage the end-to-end configuration optimization workflow. By leveraging the advanced reasoning of Large Language Models (LLMs), specifically Gemini, AgenticRecTune explore the optimal configuration spaces. The Actor Agent proposes multiple candidates and Critic Agent filters out suboptimal this http URL Online Agent autonomously prepares A/B tests based on the proposed configurations set from the Critic Agent and captures the subsequencet experimental results. We also introduce a self-evolving Skillhub, which utilizes a collaboration between the Insight Agent and Skill Agent to summarize the history results, extract underlying mechanics of each task in recommendation system and update skills.
[IR-24] A Randomized Controlled Trial and Pilot of Scout: an LLM -Based EHR Search and Synthesis Platform
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHR)系统中临床文档撰写与数据检索导致的医务人员工作负荷过重及职业倦怠问题。其解决方案的关键在于开发了一个基于大语言模型(Large Language Model, LLM)的EHR搜索与信息整合平台Scout,该平台允许临床医生使用自然语言查询EHR数据,并在每个回答中提供引用链接以追溯至原始数据源,从而支持生成内容的可验证性。临床试验表明,Scout显著缩短了任务完成时间(减少37.6%),降低了感知工作负荷(尤其在心理需求、努力和时间压力方面),且在准确性、完整性和相关性上不劣于传统EHR单独使用的方式,验证了其作为高效、高质量辅助工具的潜力。
链接: https://arxiv.org/abs/2604.26953
作者: Michael Gao,Suresh Balu,William Knechtle,Kartik Pejavara,William Jeck,Matthew Ellis,Jason Thieling,Blake Cameron,Jason Tatreau,Tareq Aljurf,Henry Foote,Michael Revoir,Marshall Nichols,Matthew Gardner,William Ratliff,Bradley Hintze,Angelo Milazzo,Sreekanth Vemulapalli
机构: 未知
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
备注:
Abstract:Clinical documentation and data retrieval within Electronic Health Records (EHRs) contribute substantially to clinician workload and burnout. To address this, we developed Scout, an LLM-based EHR search and synthesis platform that enables clinicians to query EHR data using natural language. Each response includes citations linking each claim to the original data source, facilitating easy verification of generated content. We conducted a prospective randomized, evaluator-blinded crossover trial across seven clinical specialties (20 participants, 200 structured cases). Participants completed realistic clinical tasks using either Scout or the EHR alone, with outcomes including time to completion, NASA Task Load Index workload scores, and blinded expert adjudication of accuracy, completeness, and relevance. Scout reduced task completion time by 37.6% and significantly decreased perceived workload, with the largest reductions in mental demand, effort, and temporal demand. Non-inferiority analyses showed that tasks completed with Scout maintained accuracy, completeness, and relevance relative to tasks completed with the EHR-only. A concurrent pilot deployment across over 200 users and more than 20 specialties generated over 6,600 interactions in three months, revealing diverse clinical and administrative use cases. Automated evaluation using an LLM-as-judge framework identified errors at low rates. Subsequent manual review of a subset of outputs revealed that most claims flagged by the automated judge as errors were in fact supported by the patient chart, demonstrating the importance of human validation. These findings provide early trial-based evidence that LLM-powered EHR tools can meaningfully reduce clinical and administrative workloads while maintaining output quality.
人机交互
[HC-0] Essential Yet Overlooked: Identity Verification Barriers for Blind and Low Vision People in Government Services
【速读】:该论文旨在解决盲人及低视力(Blind and Low Vision, BLV)群体在政府服务身份验证过程中面临的系统性障碍问题,这些问题导致其在获取公共服务和福利时遭受显著的不平等与排斥。研究通过混合方法(包括对219条Reddit帖子的内容分析和对16名BLV参与者的半结构化访谈)揭示了数字与实体身份验证流程中的多重不可访问性,表明当前设计不仅造成使用不便,更重构了实际安全实践,使BLV用户面临更高的排除风险。解决方案的关键在于重新设计身份验证系统以支持多模态交互(如语音、触觉反馈等),并引入生成式AI(Generative AI)作为辅助工具提升可访问性,同时警惕其可能带来的身份欺诈新风险,从而在保障安全的前提下增强BLV用户的自主权与公平接入能力。
链接: https://arxiv.org/abs/2604.28166
作者: Ryan John Oommen,Tanusree Sharma
机构: Penn State(宾夕法尼亚州立大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:
Abstract:Identity verification is a critical gateway to accessing government services and public benefits, yet contemporary systems are typically designed around visual interaction, leaving blind and low vision (BLV) individuals disproportionately burdened. In this work, we examine how BLV users navigate identity verification in government services and how current designs shape their access, security, and autonomy. Through a mixed methods study combining analysis of 219 Reddit posts and semi-structured interviews with 16 BLV participants, we uncover systemic accessibility breakdowns across both digital and in person verification processes. Our findings show that inaccessible verification workflows do not merely inconvenience users, they restructure how security is achieved in practice. We also identify how repeated verification demands, inaccessible physical infrastructure, and policy changes exacerbate exclusion from essential services. At the same time, participants articulate complex perspectives on AI, viewing it as both a critical accessibility aid and a growing vector for identity fraud.
[HC-1] Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People
【速读】:该论文旨在解决当前生成式AI在手语翻译系统中存在的结构性偏见与文化忽视问题,即技术设计者在缺乏聋人群体参与的情况下,基于有偏数据构建模型,导致对聋人手语的语义、文化及口语化特征的误读与简化。其核心问题在于:此类AI系统将手语标准化为可被技术捕捉的数据形式(如统计模型和数学语言),从而消解了手语作为人类沟通方式的本质属性,并强化了听力中心主义(audism)的权力结构。解决方案的关键在于提出“有偏AI”(Ableist Intelligence)概念,强调应重新审视AI伦理框架,推动以聋人社区为中心的设计范式,使技术服务于包容性沟通而非效率导向的规训逻辑,从而实现真正的人本化技术发展。
链接: https://arxiv.org/abs/2604.28125
作者: Nina Seron-Abouelfadil,Poppy Fynes
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Paper submitted and accepted to IJES 2026
Abstract:Sign languages, of any geographical or accentual variation, understandably face continuous scrutiny under the ever present popularity of verbal dictation and audism. Through this, many potential problems arise with the current lack of accessible communication for those who rely on such sign languages for essential conversation. Such AI systems regularly take the form of recognition and interpretation models, designed to provide seamless and accurate translation. In reality these systems are built from biased data and created without any input from deaf communities. Such models are widely used and accepted by their hearing counterparts who remain ignorant to the inherent culture, semantics and colloquial language present in gestural language systems. This phenomenon is best analysed under the scope of The Technological System and Technological bluff by Ellul. Indeed, what is at play here is the standardization of language by technicians into what can be captured by technique: data, statistics, a mathematical language. For that AI technique to exist, sign language must be rationalized, in a search for profit that annihilates the conditions for communication and fails to capture the human experience of the deaf person. By that process, it presents normative effects, creating a model of Man, standardized, massified, and who has to adapt to the tool and technical milieu instead of the other way around, which we assume should have been the goal of such a technology. Technique thus reshapes what it means to be human, to submit deaf people to the goals of productivity and efficiency. In doing so, it exhibits clear counter productivity, alienating instead of emancipating, isolating instead of nourishing human relationships. Therefore this paper argues for the idea of AI as Ableist Intelligence, as such systems seek to emphasise the humiliated and marginalised nature of sign. Comments: Paper submitted and accepted to IJES 2026 Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.28125 [cs.AI] (or arXiv:2604.28125v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.28125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-2] When and How AI Should Assist Brainstorming for AI Impact Assessment
【速读】:该论文旨在解决当前AI辅助工具在团队协作式人工智能影响评估(AI Impact Assessment)中缺乏针对性设计与效果验证的问题,特别是无法有效捕捉多元团队视角的局限。其解决方案的关键在于:通过引入源自战略远见(strategic foresight)的结构化方法,并在五次线下工作坊中与参与者共同设计AI干预措施,最终形成一套适用于团队头脑风暴阶段的AI支持策略;实证研究表明,AI在通用场景(如聊天机器人伴侣)下能提升评估质量与团队感知,但在专业场景(如肾脏分配应用)中无效,由此提炼出核心设计原则——AI应在早期创意阶段仅提供提示而非直接解决方案,仅在团队陷入僵局或饱和时介入;在收敛阶段协助结构化思路、利用专家知识优化方案;整体上应聚焦于支持繁琐的流程性任务,而非替代团队自主的创造性思维活动。
链接: https://arxiv.org/abs/2604.27997
作者: Jarod Govers,Sanja Šćepanović,Daniele Quercia
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted in April 2026 to be published in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada
Abstract:A key task in AI practice is to assess potential impacts to prevent harm. Current AI tools assisting AI impact assessment have not been designed or evaluated for collaborative team brainstorming, and they do not capture the range of views in diverse teams. We studied how AI can support team brainstorming during AI impact assessment and made three contributions. First, we adapted two structured methods from strategic foresight and co-designed AI interventions for them in five in-person workshops with 28 participants in total. Second, we evaluated the interventions in ten in-person workshops with 54 participants, finding that AI improved impact assessment quality and brainstorming perceptions for a general-purpose AI use (a chatbot companion) but not for a specialised one (a kidney allocation application). Third, our findings result in broader design guidance for AI assistance in brainstorming: AI should only offer hints and not solutions during early ideation, initiating interaction only when participants face fixation or saturation; it should facilitate structuring ideas during convergence; leverage expertise to refine ideas; and overall, it should serve more in support of tedious brainstorming process tasks, rather than ideation that teams value to do themselves.
[HC-3] Exploring Interaction Paradigms for LLM Agents in Scientific Visualization
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在科学可视化(Scientific Visualization, SciVis)任务中性能差异及其交互模式优化问题,即如何通过不同类型的LLM代理和交互方式提升可视化工作流生成的准确性、效率与鲁棒性。其解决方案的关键在于系统性比较三类代理范式——领域专用代理(structured tool use)、计算机使用代理(computer-use agents)和通用编程代理(general-purpose coding agents),并分析代码脚本、模型上下文协议(Model Context Protocol, MCP)、API调用、命令行接口(Command-Line Interface, CLI)及图形用户界面(Graphical User Interface, GUI)等交互模态的影响,同时引入持续记忆机制以增强重复任务中的表现。研究发现,无单一方案最优,未来SciVis系统应融合结构化工具使用、交互能力与自适应记忆机制,以实现性能、鲁棒性与灵活性之间的平衡。
链接: https://arxiv.org/abs/2604.27996
作者: Jackson Vonderhorst,Kuangshi Ai,Haichao Miao,Shusen Liu,Chaoli Wang
机构: Univ. Notre Dame (圣母大学); LLNL (劳伦斯利弗莫尔国家实验室)
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command-line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General-purpose coding agents achieve the highest task success rates but are computationally expensive, while domain-specific agents are more efficient and stable but less flexible. Computer-use agents perform well on individual steps but struggle with longer multi-step workflows, indicating that long-horizon planning is their primary limitation. Across both CLI- and GUI-based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.
[HC-4] From LLM -Driven Trading Card Generation to Procedural Relatedness: A Pokémon Case Study
【速读】:该论文旨在解决交易卡牌游戏(Trading Card Game, TCG)中因元游戏(metagame)趋于稳定而导致策略重复、可选卡牌减少、玩家体验下降的问题。为应对这一挑战,研究提出了一种基于生成式AI的程序化内容生成(Procedural Content Generation)方案,其关键在于构建一个融合玩家中心协同创作、微调嵌入(fine-tuned embeddings)、本地大型语言模型(Local LLMs)与图像扩散模型(Image Diffusion Models)的流水线系统,实现个性化卡牌设计的动态生成,同时通过程序化关联性(procedural relatedness)增强玩家与卡牌之间的情感连接,从而拓展创意边界并提升玩家参与度。
链接: https://arxiv.org/abs/2604.27972
作者: Johannes Pfau,Panagiotis Vrettis
机构: Utrecht University (乌得勒支大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Since the dawn of Trading Card Games, the genre has grown into a multi-billion-dollar industry engaging millions of analog and digital players worldwide. Popular TCGs rely on regular updates, balance adjustments, and rotating constraints to sustain engagement. Yet, as metagames stabilize, predictable strategies dominate and viable card options diminish, often resulting in repetitive and impaired player experiences. This paper investigates the use of Large Language Models and Image Diffusion Models for Procedural Content Generation of TCG cards, addressing these challenges by enabling a personalized infinity of card designs. Modern generative AI not only enables large-scale content creation but could even introduce procedural relatedness, fostering unique connections between players and their cards. We present a pipeline combining player-centric co-creation, fine-tuned embeddings, local LLMs, and Diffusion Models to generate dynamic, personalized cards while potentially expanding creative range. We evaluated the pipeline in a user study with 49 participants who generated 196 Pokémon card samples. Participants rated aesthetics and representativeness of visuals and mechanics, and provided qualitative feedback. Results show high satisfaction and indicate that most participants successfully realized their own ideas through prompt adjustments. These findings lay groundwork for future content generation systems and alternatives to conventional metagame evolution through procedural relatedness. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.27972 [cs.AI] (or arXiv:2604.27972v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-5] Real-Time Control of a Virtual Orchestra by Recognition of Conducting Gestures
【速读】:该论文旨在解决博物馆观众难以沉浸式体验指挥交响乐团的问题,通过构建一个180°穹顶影院中的交互式导引系统,使参观者能够以指挥家的姿态实时控制预录交响乐的播放节奏。解决方案的关键在于:利用基于骨架追踪的视觉识别模块捕捉用户手势,并通过分层长短期记忆网络(Hierarchical LSTM)将手势转化为时间控制信号,进而驱动音频播放模块调整音乐播放速度,从而实现高保真的实时音乐导引体验。
链接: https://arxiv.org/abs/2604.27957
作者: Mert Mermerci(1),Emile Pascoe(2),Fredrik Edström(3),Hedvig Kjellström(1 and 4) ((1) KTH Royal Institute of Technology, (2) SMASH Studios, (3) IVAR Studios and (4) Swedish e-Science Research Centre)
机构: KTH Royal Institute of Technology, Sweden (瑞典皇家理工学院); SMASH Studios, Sweden; IVAR Studios, Sweden; Swedish e-Science Research Centre, Sweden (瑞典电子科学研究中心)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a museum installation in a 180° dome theater, which gives the museum visitor the experience of conducting a symphony orchestra. We have pre-recorded a short music piece performed by a professional orchestra. This recording is played back in the dome with the visitor standing in the conductor’s position. The visitor’s gestures are captured with a vision-based skeleton tracker, steering the recording playback pace via a gesture recognition module that translates the gestures into a time control signal. This is sent to a playback module that plays the recording in the dome at the corresponding speed. The gesture recognition module is based on a hierarchical LSTM network, trained with recorded sequences of multiple conductors with different level of expertise conducting the same recording. The system is evaluated with a quantitative study of the estimated timing accuracy, a user study evaluating the musical realism and usability of the real-time control, and a field study to evaluate the performance of the entire system with real museum visitors.
[HC-6] MyoKin3X: A Myoelectric Framework for Full-Hand 3D Force Recording
【速读】:该论文旨在解决现有手部多方向力测量系统在指端覆盖范围、力维度完整性和解剖适应性之间存在权衡的问题,尤其针对全手同步三维力采集中多轴标定、手部尺寸适配和一致的指特定力重建等技术挑战。其解决方案的关键在于提出MyoKin3X框架,该框架集成五组3D力传感器与可定制的解剖适配结构,并配备独立软件实现肌电(electromyography, EMG)与力信号的同步采集;通过原位交叉标定机制确保各传感器轴向校准因子稳定(平均变异系数0.04%,50N下最大力误差±0.06N),有效降低轴间串扰(平均减少92.71%),并实现高预测精度(R² ≥ 0.99),同时支持多种反馈模式以标准化分析不同受试者和任务下的手部运动控制与肌肉协同机制研究。
链接: https://arxiv.org/abs/2604.27949
作者: Charlotte Rohleder,Raul Sîmpetru,Annika Wünsch,Alessandro Del Vecchio
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 3 figures
Abstract:Simultaneous multi-directional force measurement across all five digits is essential for studying hand coordination, compensatory forces, and myoelectric control, yet existing systems trade off digit coverage, force dimensionality, and anatomical adaptability. Reliable full-hand acquisition remains challenging because multi-axis calibration, hand-size adjustment, and consistent digit-specific force reconstruction are technically demanding. We present MyoKin3X, a customizable full-hand framework for simultaneous 3D force measurement of up to five digits providing robust and validated force reconstruction. It combines an anatomically versatile structure with five integrated 3D force sensors and a standalone software for synchronized electromyography and force acquisition. MyoKin3X provides in-place cross-calibration of all five sensors, single- and multi-digit maximal voluntary contraction recording, and automated coordinate transformation to digit-specific coordinate systems for standardized analysis across subjects and tasks. Calibration validation demonstrates high stability of the axis-specific calibration factors, with a mean coefficient of variation of 0.04% and maximum force error of ± 0.06N at 50N. It also shows effective inter-axis decoupling (mean crosstalk reduction: 92.71%; residual crosstalk below 0.02% for most axis pairs) and high predictive accuracy (R2 0.99 across sensors). The software includes four feedback modes: 1D ramps, fatigue protocols, 2D arbitrary target ramps, and 2D exploratory tasks. MyoKin3X therefore enables standardized full-hand force acquisition with validated measurement reliability, flexible protocol control, and real-time visualization for high-fidelity studies of hand motor control, muscle synergies, and human-machine interfacing.
[HC-7] Enhancing multimodal affect recognition in healthcare: the robustness of appraisal dimensions over labels within age groups and in cross-age generalisation
【速读】:该论文旨在解决生成式 AI(Generative AI)在辅助认知训练(Computerized Cognitive Training, CCT)中情感识别(affect recognition)的挑战,尤其是跨年龄群体的情感建模泛化问题。其关键解决方案在于采用基于评价理论(appraisal theories)的情感维度(appraisal dimensions)替代传统的类别标签(categorical labels),并结合多模态融合与深度学习表示方法进行建模。实证结果表明,基于评价维度的模型在不同年龄群体间均表现出更高的预测准确性和稳定性,而类别标签模型则无法跨年龄泛化,性能退化至随机水平。这证明了评价维度在情感计算中的理论和实践优势,为跨年龄情境下连续时间情感预测提供了可靠框架。
链接: https://arxiv.org/abs/2604.27938
作者: Hippolyte Fournier,Sina Alisamir,Safaa Azzakhnini,Isabella Zsoldos,Eléonore Trân,Gérard Bailly,Frédéric Elisei,Béatrice Bouchot,Brice Varini,Patrick Constant,Joan Fruitet,Franck Tarpin-Bernard,Solange Rossato,François Portet,Olivier Koenig,Hanna Chainay,Fabien Ringeval
机构: Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG; Univ. Lyon 2, EMC; GIPSA-lab, Univ. Grenoble Alpes; ATOS company; Pertimm company; Humans Matter company
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The integration of artificial intelligence (AI) into healthcare has advanced significantly, yet affect recognition remains a major challenge, particularly in AI-assisted interventions such as Computerized Cognitive Training (CCT). The THERADIA-WoZ corpus was developed to enable multimodal affect recognition in the context of AI-driven CCT, focusing on an older adult population. This study extends the corpus by introducing a dataset collected from young adults, allowing direct comparison of affect recognition models across age groups. Our objective was to assess whether multimodal models based on dimensions borrowed from appraisal theories outperform those based on categorical labels and to evaluate their generalisation power across age corpora. After comparing both corpora, models were trained and tested using within-corpus, cross-corpus, and mixed-corpus evaluation. Results revealed that appraisal dimensions consistently outperformed categorical labels across all conditions, demonstrating greater predictive accuracy and stability. Notably, categorical labels failed to generalise across age corpora, as performance dropped to chance levels in cross-corpus evaluation. In contrast, appraisal dimensions maintained predictive performance above chance, reinforcing their robustness for cross-age affect recognition. Furthermore, training on both corpora did not improve generalisation beyond within-corpus training. The findings support the theoretical and practical advantages of appraisal dimensions over categorical labels in affective computing. They also highlight the importance of multimodal fusion and deep learning representations for emotion modeling. To facilitate future research, we provide an API for researchers interested in time-continuous emotion prediction, offering valuable tools for behavioral sciences to enhance the measurement of emotional states in various experimental settings.
[HC-8] CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments
【速读】:该论文旨在解决普通用户在社交媒体上进行批判性新闻阅读(Critical News Reading, CNR)时面临的挑战,即难以全面理解新闻主旨并开展批判性思考。现有研究表明,新闻下的评论能够提供补充信息和多元观点,有助于提升CNR效果,但如何有效利用这些评论仍缺乏系统研究。论文提出解决方案的关键在于开发一个基于大语言模型(Large Language Model, LLM)的交互式工具CoNewsReader,其核心功能包括:借助评论获取新闻主旨的补充信息、筛选对CNR有价值的评论,以及基于评论生成引导性问题以促进批判性思维。实证研究表明,相较于传统社交媒体新闻界面,CoNewsReader显著提升了用户的CNR参与度与认知表现。
链接: https://arxiv.org/abs/2604.27905
作者: Kangyu Yuan,Guanzheng Chen,Sizhe Liang,Hehai Lin,Qingyu Guo,Dingdong Liu,Xiaojuan Ma,Zhenhui Peng
机构: The Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学); Sun Yat-sen University (中山大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Human-Computer Interaction (cs.HC)
备注: (CSCW 26) THE 29TH ACM CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING
Abstract:Critical news reading (CNR), which requires grasping the holistic ideas of and raising critical thoughts on the news, is beneficial yet challenging for general people who usually get information on daily social media. Comments under the news can aid CNR by providing complementary information and other readers’ diverse and critical thoughts. However, it is under-investigated how to leverage these comments to support users in CNR. In this paper, we first derive user requirements for a comment-based CNR tool from literature and a formative study (N=12). Then, we develop CoNewsReader, a comment-based interactive CNR tool powered by a large language model. CoNewsReader supports users in grasping the news idea with complementary information from comments, filtering useful comments for CNR, and getting questions generated based on the comments to conduct critical thinking. Our within-subjects study with 24 university students indicates that compared to a baseline news reading interface in social media, participants with CoNewsReader have a more engaging CNR experience and perform better on comprehending the news and raising critical thoughts. We discuss design considerations for supporting reading tasks with user- and machine-generated content.
[HC-9] Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs
【速读】:该论文旨在解决当前代理系统(agent systems)普遍依赖硬编码架构、固定角色分工与交互流程所带来的局限性,这些限制使得系统难以实现个性化适配和动态情境响应。其核心问题在于:现有多代理协作机制缺乏灵活性,无法根据用户特征、任务需求及工作流上下文进行实时调整,从而影响了交互效率与适用性。解决方案的关键在于提出一种运行时按需生成AI人格(persona)的流水线方法,通过动态构建与用户属性、任务目标和环境条件相匹配的代理人格,使代理平台能够摆脱“一刀切”的配置模式,迈向更智能、自适应的协同工作范式。
链接: https://arxiv.org/abs/2604.27882
作者: Giuseppe Arbore,Andrea Sillano,Luigi De Russis
机构: Politecnico di Torino(都灵理工大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Recent advances in agentic AI are shifting automation from discrete tools to proactive multi-agent systems that coordinate multi-specialized capabilities behind unified interfaces. However, today’s agent systems typically rely on hard-coded agent architectures with fixed roles, coordination patterns, and interaction flows that limit end-user personalization and make adaptation to individual needs and contexts difficult. Given this limitation, we argue that on-demand persona-based agent generation offers a promising path towards more efficient and contextually appropriate interaction within agentic workflows. By dynamically crafting agents and personas at run-time to match user characteristics, task demands, and workflow context, agentic platforms can move beyond one-size-fits-all configurations. We present a pipeline for on-demand persona generation in agentic platforms, detailing how real-time crafting of AI personas can be systematically integrated within agent systems, aiming to open new possibilities in agentic platform design paradigms.
[HC-10] “It depends on where AI is used”: Players attitude patterns and evaluative logics toward different AI applications in digital games
【速读】:该论文旨在解决数字游戏中玩家对人工智能(AI)应用的接受度为何存在差异的问题,特别是探讨在不同应用场景下玩家态度的形成机制。其解决方案的关键在于通过分析1,856份有效开放式问卷反馈,识别出六种评价逻辑:体验增强(experiential enrichment)、工具效率(instrumental efficiency)、系统可靠性(system reliability)、自主性与控制权(agency and control)、作者身份与合规性(authorship and compliance),以及人类监督(human oversight),从而揭示玩家对AI干预游戏过程的接受或拒绝行为背后的深层动因,强调AI接受度具有高度情境敏感性。
链接: https://arxiv.org/abs/2604.27812
作者: Ting-Chen Hsu,Jiangxu Lin,Wenran Chen,Fei Qin,Zheyuan Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As AI becomes increasingly embedded in digital games, players’ attitudes de-pend not only on whether AI is used, but also on where and how it intervenes in gameplay. This study examines players’ evaluative patterns toward eight AI application contexts, including intelligent NPCs, emergent narrative, dynamic balancing, recommendation systems, review and governance, art asset generation, co-creation gameplay, and gameplay evolution. Based on 1,856 valid open-ended responses from 310 questionnaires, we conducted thematic analysis to identify reasons for acceptance, rejection, and conditional acceptance. Results show that players welcomed AI when it enhanced immersion, personalization, novelty, efficiency, or convenience, but resisted it when it threatened creativity, emotional authenticity, autonomy, fairness, system stability, authorship, or accountability. We further identify six evaluative logics: experiential enrichment, instrumental efficiency, system reliability, agency and control, authorship and compliance, and human oversight. These preliminary findings highlight the context-sensitive nature of AI acceptance in digital games.
[HC-11] Agent Economist: An End-to-end Agent ic System Translating Economic Intuitions into Executable Computational Experiments
【速读】:该论文旨在解决经济学研究中“将直观经济直觉转化为可验证研究”的长期挑战,即如何高效、可靠地从理论洞察生成可执行的计算实验。解决方案的关键在于提出AgentEconomist系统,这是一个端到端的交互式框架,采用模块化多阶段架构:包括基于13,000篇高质量学术论文构建的领域知识库驱动的“思想发展阶段”(Idea Development Stage)、与模拟器对齐的“实验设计阶段”(Experimental Design Stage)以及“实验执行阶段”(Experimental Execution Stage)。该系统通过人机协同(human-in-the-loop)迭代流程,实现从抽象直觉到结构化计算实验的自动转化,显著提升研究创意的文献根基性、新颖性和洞察力,优于当前最先进的通用大语言模型(LLMs)。
链接: https://arxiv.org/abs/2604.27725
作者: Jiaju Chen,Jinghua Piao,Xia Xu,Songwei Li,Tong Xia,Xiangnan He,Yong Li
机构: Zhongguancun Academy(中关村学院); University of Science and Technology of China(中国科学技术大学); Tsinghua University(清华大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:A long-standing challenge in economics lies not in the lack of intuition, but in the difficulty of translating intuitive insights into verifiable research. To address this challenge, we introduce AgentEconomist, an end-to-end interactive system designed to translate abstract intuitions into executable computational experiments. Grounded in a domain-specific knowledge base covering over 13,000 high-quality academic papers, the system employs a modular multi-stage architecture. Specifically, the Idea Development Stage generates literature-grounded hypotheses, the Experimental Design Stage configures simulator-aligned experimental parameters and protocols, and the Experimental Execution Stage runs experiments and returns structured analyses. Together, these stages form a human-in-the-loop, iterative workflow that translates economic intuitions into executable computational experiments. Through extensive experiments involving human expert evaluation and large language models (LLMs) as judges, we show that the system generates research ideas with stronger literature grounding and higher novelty and insight than state-of-the-art generic LLMs. Overall, AgentEconomist adopts a human-AI collaboration paradigm that enables researchers to focus on high-level intuitions, while delegating the labor-intensive processes of translation and computational execution to agents.
[HC-12] Users Activity Logs: the Good the Bad the Misconception and the Disastrous
【速读】:该论文旨在解决现有研究过度聚焦于用户活动日志(activity logs)的隐私风险与用户认知不足等负面问题,而忽视了用户对日志的积极感知、误解及极端负面影响的平衡视角缺失问题。其解决方案的关键在于通过针对谷歌(Google)活动控制功能的案例研究,基于30名沙特阿拉伯个人账户用户的访谈数据进行二次分析,采用模板分析法(template analysis)从“良好”“不良”“误解”和“灾难性”四个主题维度系统挖掘用户对活动日志的多维感知,从而揭示新的使用场景与认知模式,为服务提供商、隐私研究者及用户提供了更全面的理解框架与实践建议。
链接: https://arxiv.org/abs/2604.27676
作者: Eman Alashwali
机构: King Abdulaziz University (国王阿卜杜勒阿齐兹大学)
类目: Human-Computer Interaction (cs.HC)
备注: To appear at the Information and Computer Security Journal (Emerald Publishing)
Abstract:Most service providers, such as Google, save logs from data generated by users while using the service. Many service providers provide users with privacy controls to manage whether, how, and for how long the data is saved and used by the service provider. While most prior studies focused on the negative side of users’ activity logs, such as users’ lack of awareness about the logs’ privacy controls and users’ privacy concerns toward their data, this work aims to provide a balanced view of users’ perceptions regarding activity logs by considering the positive, negative, and extremely negative (hence disastrous) sides, as well as the misconceptions of activity logs. In this work, we present a case study of Google’s Activity controls by conducting a secondary analysis of interview data from 30 Google personal account holders in Saudi Arabia. Using template analysis, we analyzed the data from the lens of four main themes: the good, the bad, the misconception, and the disastrous aspects of users’ activity logs from the users’ perspective. Our findings uncover new themes and use cases, offering a balanced view of users’ perceptions of activity logs, and provide a better understanding and a useful source for subsequent studies on related topics. We conclude with practical recommendations for service providers, privacy researchers and experts, and users alike.
[HC-13] he TEA Nets framework combines AI and cognitive network science to model targets events and actors in text
【速读】:该论文旨在解决文本中情感、语义和句法结构难以被系统性提取与解释的问题,尤其是在阴谋论文本和大语言模型(LLM)生成的对话文本中。解决方案的关键在于提出Target-Event-Agent Networks(TEA Nets),这是一种基于认知网络科学与人工智能的计算框架,能够从文本中自动识别出“主体(Agents)”、“事件(Events)”和“目标(Targets)”三元组,并通过网络分析方法实现可解释的情感检测、语义框架解析及语言学探究。该方法在多个数据集上验证了其有效性,揭示了高阴谋倾向文本中人称代词与情绪动作的高频关联以及LLM与人类在心理治疗语境下情感表达强度的差异。
链接: https://arxiv.org/abs/2604.27673
作者: Sebastiano Franchini,Alexis Carrillo,Edoardo Sebastiano De Duro,Riccardo Improta,Ali Aghazadeh Ardebili,Massimo Stella
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:We introduce Target-Event-Agent Networks (TEA Nets) as a computational framework to extract subjects (Agents"), verbs (Events"), and objects (Targets") from texts. Grounded in cognitive network science and artificial intelligence, TEA Nets are implemented as an open-source Python library. We test TEA Nets in three case studies, demonstrating the framework's ability to perform interpretable emotion detection, semantic frame analyses, and linguistic inquiries across conspiracy texts and textual responses generated by LLMs. In the LOCO conspiracy corpus, TEA Nets revealed that highly conspiratorial narratives (4,227 texts) linked personal pronouns (I", you", we") with the same actions twice as frequently as low-similarity conspiracy narratives. High-conspiracy narratives connected person-focused elements (you", people") through actions eliciting anger above the random baseline ( z = 2.63, p .05 ), a trend absent in low-similarity conspiracy narratives, which emphasized scientific actors (researcher", scientist"). In the HOPE and CounseLLMe datasets of 212 (human) and 200 (LLM-based) psychotherapy transcripts, respectively, TEA Nets highlighted emotional differences. When expressing feelings, Claude 3 Haiku, GPT-3.5, and humans used sad words with higher frequency than random expectations but Haiku expressed sadness with lower emotional intensity than humans ( U = 1243.5, p = .036 ). We discuss these differences in the context of psychotherapy training on LLM-simulated patients. Our results show that Target-Event-Agent Networks can extract relevant emotional, syntactic, and semantic insights from narratives, opening new avenues for text analysis with cognitive network science.
[HC-14] Mapping how LLM s debate societal issues when shadowing human personality traits sociodemographics and social media behavior
【速读】:该论文旨在解决当前缺乏系统性数据集来研究大型语言模型(Large Language Models, LLMs)在不同社会情境和提示条件下输出差异的问题。其解决方案的关键在于构建了一个包含19万条记录的合成语料库——认知数字影子(Cognitive Digital Shadows, CDS),其中每条记录由19个LLMs生成,且被引导扮演人类角色或AI助手角色,在疫苗/医疗、社交媒体虚假信息、科学领域的性别差距及STEM刻板印象等四个争议性社会议题上生成响应。CDS通过编码17个社会人口统计与心理属性,实现了对提示、语言风格、立场和推理过程的关联分析,并提供可解释自然语言处理(Interpretable NLP)支持情感与语义框架的量化比较,从而为未来LLM偏见、社会敏感性和对齐性的审计提供了可扩展的数据基础与交互式分析平台。
链接: https://arxiv.org/abs/2604.27624
作者: Ali Aghazadeh Ardebili,Massimo Stella
机构: CogNosco Lab, University of Trento, Department of Psychology and Cognitive Science (心理学与认知科学系), Trento, Italy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) can strongly shape social discourse, yet datasets investigating how LLM outputs vary across controlled social and contextual prompting remain sparse. Cognitive Digital Shadows (CDS) is a 190,000-record synthetic corpus supporting analyses of LLM-generated discourse. Each CDS record is generated by one of 19 LLMs, prompted to shadow either a human persona or an AI-assistant role. CDS contains LLM responses on 4 controversial societal topics: vaccines/healthcare, social media disinformation, the gender gap in science, and STEM stereotypes. Persona-conditioned records encode 17 sociodemographic and psychological attributes, providing data linking LLMs’ prompts, language, stances and reasoning. Texts are validated for topic anchoring and can support emotional analyses via interpretable NLP (e.g. textual forma mentis networks). CDS is enriched by a pooling platform with user-friendly dashboards, enabling easy, interactive group-level comparisons of emotional and semantic framing across personas, topics and models. The CDS prompting framework supports future audits of LLMs’ bias, social sensitivity and alignment.
[HC-15] Math Education Digital Shadows for facilitating learning with LLM s: Math performance anxiety and confidence in simulated students and AIs
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学教育中应用时缺乏对其数学能力与偏见的系统性数据支持的问题。现有评测多仅关注分数表现,忽略了如自我效能感、数学焦虑及认知网络等心理与社会因素对教学效果的影响。解决方案的关键在于提出MEDS(Math Education Digital Shadows)数据集,该数据集通过28,000个来自14种LLM家族(如Mistral、Qwen、DeepSeek等)的虚拟人格(persona),模拟人类与AI助手两种情境下的数学推理行为,并整合四类任务:开放式数学访谈、关于数学感知的心理测量测试、数学态度的认知网络建模以及高中水平数学题及其推理过程和置信度评分。此设计使研究者能够全面分析LLMs在数学教育中的表现及其潜在偏差,从而为学习分析、认知科学和更安全的数学AI助教开发提供基础支撑。
链接: https://arxiv.org/abs/2604.27618
作者: Naomi Esposito,Anthony Tricarico,Luisa Porzio,Ali Aghazadeh Ardebili,Massimo Stella
机构: University of Trento (特伦托大学); CogNosco Lab (认知科学实验室)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:To enhance LLMs’ impact on math education, we need data on their mathematical prowess and biases across prompts. To fill this gap, we introduce MEDS (Math Education Digital Shadows) as a dataset mapping how large language models reason about and report mathematics across human- and AI-like conditions. MEDS involves 28,000 personas from 14 LLMs (from families like Mistral, Qwen, DeepSeek, Granite, Phi and Grok) shadowing either humans or AI assistants. Each record/shadow includes a set of prompts along with psychological/sociodemographic persona metadata and four types of math tasks: (i) open math interview, (ii) three psychometric tests about math perceptions with explanations, (iii) cognitive networks capturing math attitudes, and (iv) 18 high-school math test questions together with their reasoning and confidence scores. MEDS differs from traditional score-only math benchmarks because it integrates concepts of self-efficacy, math anxiety, and cognitive network science besides math proficiency scores. Data validation shows that the sampled LLMs exhibit schema integrity and consistent personas, together with family-specific peculiarities like human-like negative math attitudes, logical fallacies, and math overconfidence. MEDS will benefit learning analytics experts, cognitive scientists, and developers of safer AI tutors in mathematics.
[HC-16] Knowledge Affordances for Hybrid Human-AI Information Seeking
【速读】:该论文旨在解决在日益异构的信息生态系统中,人类与人工智能代理如何识别有意义的信息获取机会这一核心问题,尤其是在混合人机环境中,面对复杂多样的知识源时,决策者难以判断“向谁寻求知识”以及“为何选择该来源”。其解决方案的关键在于提出**知识可及性(Knowledge Affordance, KA)**的概念,即通过一种声明式、语义基础的描述方式,明确刻画知识源所能提供的信息类型、适配的问题类别及其上下文属性。KA被视为关系性概念,源于任务目标、代理偏好与情境因素之间的动态交互,从而为构建具备更高透明度、适应性和共享理解能力的KA感知系统提供理论基础和研究方向。
链接: https://arxiv.org/abs/2604.27539
作者: Irene Celino
机构: Cefriel(CEFREL)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, accepted at Hybrid Human Artificial Intelligence Conference (HHAI 2026)
Abstract:As information ecosystems grow more heterogeneous, both humans and artificial agents increasingly face a simple yet unresolved question: when seeking knowledge, whom should we ask, and why? Inspired by how people intuitively “read a room”, this paper introduces the concept of knowledge affordance (KA) to systematize how agents identify meaningful opportunities for information seeking in hybrid human-AI environments. Rather than introducing a fully formed framework, we propose KAs as declarative, semantically grounded descriptions of what a knowledge source can offer, for which kinds of questions, and with which contextual properties. Additionally, we suggest that KAs are relational, possibly emerging from the interplay between the agent’s task, preferences and situational factors. Our contribution is thus a conceptual proposal that connects different research streams, including affordances, semantic web services, knowledge engineering and querying, and mutual intelligibility. We sketch possible research directions to build KA-aware systems that navigate information spaces with greater transparency, adaptability and shared understanding. Comments: 10 pages, accepted at Hybrid Human Artificial Intelligence Conference (HHAI 2026) Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27539 [cs.HC] (or arXiv:2604.27539v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.27539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-17] pviz: Interactive Linear Programming Visualization
【速读】:该论文旨在解决线性规划(Linear Programming, LP)求解过程中缺乏直观可视化工具的问题,尤其在教学和算法研究中难以快速理解不同优化算法的行为差异。解决方案的关键在于开发了一个基于浏览器的交互式可视化工具 lpviz,其核心创新在于允许用户直接通过图形界面绘制和编辑可行域与目标向量,而无需手动输入复杂数值系数;同时支持多种主流线性规划算法(如单纯形法、内点法、原对偶混合梯度法及中心路径法)的对比演示,并在三维模式下通过高度映射重要求解器元数据(如互补间隙或KKT残差),从而帮助用户深入洞察算法收敛行为与参数设置的影响。
链接: https://arxiv.org/abs/2604.27518
作者: Evan Grand,Michael Klamkin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Optimization and Control (math.OC)
备注:
Abstract:This paper presents lpviz, a browser-based visualization tool for linear programming. lpviz is deeply interactive, offering an intuitive interface where users can directly draw and edit the feasible region and objective vector, without requiring cumbersome manipulation of raw numerical coefficients. lpviz lets users compare the behavior of several classes of linear programming algorithms, namely Simplex, Interior-Point, Primal-Dual Hybrid Gradient, and Central Path. In the 3D mode, lpviz places iterates at heights corresponding to important solver metadata such as complementarity gap or KKT residual, helping users gain further insight into algorithm behavior beyond the primal iterates alone. lpviz has been used in both research and classroom settings, to help develop intuition for the strengths and weaknesses of different solvers and the impact of solver settings on convergence behavior. lpviz is open-source, permissively licensed, and freely available on any device with a web browser at this https URL .
[HC-18] Im Fine But My Voice Isnt: Cross-Modal Affective Dissonance Detection for Reflective Journaling
【速读】:该论文旨在解决数字日记写作中存在的“情感真实性缺口”问题,即用户在将原始情绪转化为文字时往往会进行情感修饰,导致文本与实际情绪表达(如语音声学特征)之间出现不一致。为此,作者提出跨模态情感不协调检测(Cross-Modal Affective Dissonance Detection, CADD),通过三分类任务区分三种情绪表达模式:掩饰型(正向文本、负向声学)、应对型(负向文本、正向声学)以及一致型(文本与声学一致),其理论基础源于Gross的情绪调节过程模型。解决方案的关键在于构建了一个具有共享句子池设计的1,800样本TTS数据集CADD-Journal,以有效隔离声学信号与文本内容,并引入一种带有非对称跨模态注意力机制的双编码器模型DACM,该模型通过四步消融实验证明非对称注意力是提升性能的核心因素(macro-F1提升0.242),同时揭示了当前训练数据与真实自然语音之间存在显著领域差距,为未来野外语料库构建提供了明确方向。
链接: https://arxiv.org/abs/2604.27517
作者: Sumin Lee
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Digital journaling creates an authenticity gap: users consciously translate raw emotions into text, often sanitizing narratives even in private writing. We formalize this as Cross-Modal Affective Dissonance Detection (CADD), a directional three-way classification distinguishing Masking (positive text, negative acoustics), Coping (negative text, positive acoustics), and Congruent utterances, grounded in Gross’s process model of emotion regulation. We present three further contributions: (i) CADD-Journal, a 1,800-sample TTS dataset with a shared-sentence-pool design that provably isolates acoustic signal from textual content; (ii) DACM, a dual-encoder model with asymmetric cross-modal attention that re-solves a gradient degeneracy in pooled fusion, achieving macro-F1 0.711 - with a four-step ablation demonstrating that asymmetric attention is the dominant driver (+ 0.242) while the DIM is effective only on cross-modal features (+0.033); and (iii) a domain gap quantification: zero-shot evaluation across three naturalistic corpora reveals a substantial gap between TTS-trained models and real speech, and we identify two concrete requirements for future in-the-wild corpus construction. ReflectJournal, a proof-of-concept iOS application, operationalizes the framework and provides a deployment platform for naturalistic data collection.
[HC-19] Examining discontinuance of AI-mediated informal digital learning of English (AI-IDLE) among university students: Evidence from SEM and fsQCA
【速读】:该论文旨在解决大学生成长于人工智能(AI)辅助的非正式数字英语学习(AI-IDLE)情境中,其学习中断意愿(discontinuance intention)的形成机制问题。研究基于认知-情感-行为意向框架,揭示了三个认知因素——期望落差(disconfirmation)、感知复杂性(perceived complexity)和感知风险(perceived risk)如何通过引发不满(dissatisfaction)与挫败感(frustration)这两个情感反应,进而影响学习者终止使用AI工具的倾向。关键解决方案在于:首先,通过结构方程模型(SEM)验证情感反应对中断意愿的正向预测作用,其中挫败感效应更强;其次,利用模糊集定性比较分析(fsQCA)发现导致高中断意愿存在多种因果配置路径,表明中断行为具有因果复杂性和等效性(equifinality),而非单一必要条件驱动。这为优化AI支持下的非正式英语学习体验、降低用户负面情绪提供了理论依据与实践方向。
链接: https://arxiv.org/abs/2604.27506
作者: Yiran Du,Huimin He
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study examined university students’ discontinuance intention towards AI-mediated informal digital learning of English (AI-IDLE). Drawing on the cognition-affect-conation framework, the study investigated how three cognitive factors, namely disconfirmation, perceived complexity, and perceived risk, influence two affective responses, namely dissatisfaction and frustration, and how these affective responses predict discontinuance intention. A cross-sectional survey was conducted with 746 Chinese university students who had experience using AI tools for informal English learning. Data were analysed using structural equation modelling (SEM) and fuzzy-set qualitative comparative analysis (fsQCA). The SEM results showed that dissatisfaction and frustration positively predicted discontinuance intention, with frustration showing the stronger effect. Disconfirmation, perceived complexity, and perceived risk also positively influenced dissatisfaction and frustration. The fsQCA results further identified multiple sufficient configurations leading to high AI-IDLE discontinuance intention, indicating that discontinuance is shaped by causal complexity and equifinality rather than by a single necessary condition. These findings extend AI-IDLE research from adoption and engagement to post-adoption disengagement and provide implications for reducing learners’ dissatisfaction, frustration, perceived complexity, and risk in AI-supported informal English learning.
[HC-20] From Elastic to Viscoelastic: An EEMD-Enhanced Pulse Transit Time Model for Robust Blood Pressure Estimation
【速读】:该论文旨在解决基于脉搏传导时间(Pulse Transit Time, PTT)的无袖带血压(Cuffless Blood Pressure, BP)估计算法在快速血流动力学波动下精度下降的问题。传统模型依赖Moens-Korteweg方程,假设动脉壁为纯弹性材料,忽略了血管固有的粘弹性特性(viscoelasticity),导致在高血压等病理状态下性能受限。解决方案的关键在于提出一种物理信息驱动的框架,引入粘弹性补偿机制:首先通过改进的Akima插值(Modified Akima, Makima)对原始光电容积脉搏波(Photoplethysmogram, PPG)信号进行高保真重建;其次采用交切线法(Intersecting Tangent Method)实现脉搏波起始点的精确定位;最关键的是利用集合经验模态分解(Ensemble Empirical Mode Decomposition, EEMD)分离出高频本征模态函数(Intrinsic Mode Functions, IMFs),定义“粘弹性速度指标”(Viscoelastic Velocity Metric)量化通常被弹性模型忽略的血管阻尼效应(η·ε̇),从而显著提升模型对血管迟滞现象(vascular hysteresis)的鲁棒性。
链接: https://arxiv.org/abs/2604.27500
作者: Boyuan Gu,Yijin Yang,Shuaiqi Cheng,Xiaorong Ding
机构: UESTC; Tsinghua University SIGS
类目: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: 4 pages, 5 figures
Abstract:Cuffless blood pressure (BP) estimation based on Pulse Transit Time (PTT) has emerged as a promising solution for continuous health monitoring. However, conventional models relying on the Moens-Korteweg equation often fail during rapid hemodynamic fluctuations, as they assume arterial walls are purely elastic and neglect inherent viscoelasticity. To address this limitation, we propose a physics-informed framework introducing a viscoelastic compensation mechanism. First, raw photoplethysmogram (PPG) signals undergo high-fidelity reconstruction using Modified Akima (Makima) interpolation. Second, a robust Intersecting Tangent Method is applied for precise pulse foot localization. Crucially, we utilize Ensemble Empirical Mode Decomposition (EEMD) to isolate high-frequency Intrinsic Mode Functions (IMFs), defining a ``Viscoelastic Velocity Metric’’ to quantify the vascular damping effect ( \eta \cdot \dot\epsilon ) typically ignored by elastic models. The framework was rigorously validated on a challenging subset of the MIMIC-II database (364 subjects, 28,525 cardiac cycles) characterized by a high prevalence of hypertension (23.4%). Experimental results demonstrate medical-grade accuracy, yielding a Root Mean Square Error (RMSE) of 5.22 mmHg for Systolic and 3.65 mmHg for Diastolic BP, with Pearson correlation coefficients ( R 0.97 ). These findings confirm that incorporating viscoelastic features significantly enhances robustness against vascular hysteresis.
[HC-21] Why Learners Drift In and Out: Examining Intermittent Discontinuance in AI-Mediated Informal Digital English Learning (AI-IDLE) Using SEM and fsQCA
【速读】:该论文旨在解决人工智能赋能的非正式英语数字学习(AI-IDLE)中用户间歇性中断使用的问题,即学习者在初期采纳后出现的临时性退出现象。研究基于认知-情感-行为倾向框架,通过结构方程模型(SEM)与模糊集定性比较分析(fsQCA)揭示了导致中断的核心机制:感知智能、互动性和个性化通过提升学习愉悦感间接降低中断风险;而感知无效性、不可控性和复杂性则通过引发无聊情绪增加中断概率。关键解决方案在于设计以愉悦感为核心、兼具个性化、可控性和认知可管理性的AI支持英语学习体验,从而系统性缓解不同组合的认知障碍与情感脱离所引发的间歇性中断行为。
链接: https://arxiv.org/abs/2604.27493
作者: Yiran Du,Huimin He
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study examined intermittent discontinuance in AI-mediated informal digital learning of English (AI-IDLE) through the cognition-affect-conation framework. Survey data were collected from 632 Chinese university EFL learners with prior AI-IDLE experience and analysed using structural equation modelling and fuzzy-set qualitative comparative analysis. The SEM results showed that perceived intelligence, perceived interactivity, and perceived personalisation reduced AI-IDLE intermittent discontinuance indirectly through enjoyment, whereas perceived ineffectiveness, perceived uncontrollability, and perceived complexity increased discontinuance indirectly through boredom. The fsQCA results further identified four configurational pathways leading to intermittent discontinuance, indicating that learners’ temporary withdrawal from AI-IDLE can result from different combinations of cognitive barriers and affective disengagement. These findings extend AI-IDLE research from adoption and continuance to post-adoption discontinuance and highlight the need to design AI-supported English learning experiences that are enjoyable, personalised, controllable, and cognitively manageable.
[HC-22] Beyond One-Size-Fits-All Exercises: Personalizing Computer Science Worksheets with Large Language Models ICSE
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在教育场景中主要面向学生应用、而忽视教师端支持的问题,特别是如何利用LLM为计算机科学入门课程(CS1)中的教师提供可操作的个性化教学内容生成能力。其解决方案的关键在于基于FACET(Framework for Adaptive Content using Educational Technology)系统进行实践性适配,将学习者分为四类画像(依据布卢姆认知分类学和自我决定理论),并据此生成在支架结构(scaffolding)、教学明确度(instructional explicitness)和语气(tone)上差异化的内容。实证研究表明,这种以学习者画像驱动的LLM个性化策略显著提升了低知识-低动机群体的任务完成率(从约70%提升至99%),同时保持了任务的“理想难度”(desirable difficulty),从而有效缓解了高风险学生的参与流失问题,实现了对教学效率与公平性的双重优化。
链接: https://arxiv.org/abs/2604.27433
作者: Franco Ortiz,Runlong Ye,Michael Liut
机构: University of Toronto Mississauga (多伦多大学密西沙加分校); University of Toronto (多伦多大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at ITiCSE
Abstract:Large Language Models (LLMs) have been widely applied to student-facing educational tools, this work explores their use in supporting instructors by presenting a practical adaptation of the Framework for Adaptive Content using Educational Technology (FACET) system to generate personalized instructional materials for an Introduction to Computer Programming (CS1) course. We conducted a mixed-methods study with 409 first-year computer science (CS) students, focusing on regular expressions (RegEx). Students were assessed on their knowledge and motivation, classified into one of four learner profiles, and assigned either LLM-personalized (treatment) or standard non-adaptive (control) exercises. Personalized materials varied in scaffolding, instructional explicitness, and tone based on learner profiles grounded in Bloom’s Taxonomy and Self-Determination Theory. Quantitative analysis reveals that standard exercises resulted in task incompletion among low-knowledge learners, with approximately 25-30% incompletion, whereas personalized materials sustained near-universal completion (99%) across all profiles. While high-performing students experienced ceiling effects, Low Knowledge/Low Motivation students achieved significantly higher correctness (+18.2%) with personalized support. Survey data indicate that students prioritize structural scaffolding (logical sequence, difficulty pacing) over motivational tone and perceive the adaptive tasks as equally challenging as standard exercises. These findings suggest that learner-profile-driven LLM personalization primarily serves as a retention scaffold, preventing task abandonment among at-risk students without diminishing the task’s “desirable difficulty”. The results demonstrate that instructor-facing LLM systems can effectively close engagement gaps in CS1 by tailoring instructional explicitness to student needs. Comments: Accepted at ITiCSE Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.27433 [cs.HC] (or arXiv:2604.27433v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.27433 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3803400.3809330 Focus to learn more DOI(s) linking to related resources Submission history From: Harry Ye [view email] [v1] Thu, 30 Apr 2026 05:15:30 UTC (115 KB)
[HC-23] Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous Teams
【速读】:该论文旨在解决人机协同决策过程中领导权归属模糊的问题,即当人类与人工智能(Artificial Intelligence, AI)共同参与重要决策时,难以清晰识别谁真正主导了决策过程及其背后的权力结构。其解决方案的关键在于提出一个面向领导者的五级决策配置谱系:纯人类主导(Pure Human)、半人马模式(Centaur,人类主导且AI在环中)、平等协作(Co-equal)、牛头怪模式(Minotaur,AI主导且人类在环中)和纯AI主导(Pure AI)。该谱系通过明确“谁定义问题、谁引导工作方向、谁对结果负责”三个维度来定位决策中的领导位置,帮助领导者识别当前配置、察觉配置变化,并判断其是否适配具体决策场景。此外,论文引入“共适应性”(co-adaptability)概念,强调人与非人类参与者需协同调整以提升整体效能,从而为组织中权力、责任与信任的分配提供可操作的分析框架。
链接: https://arxiv.org/abs/2604.27392
作者: Alejandro R. Jadad
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 13 pages, 1 figure, 1 table, 1 appendix, 8 references
Abstract:What shapes a consequential decision when human and artificial intelligence work on it together? The answer is becoming harder to see. A decision may look human-led after AI has set the frame, or appear automated while human judgment still carries decisive force. This paper offers a leadership-facing spectrum to see those relationships within a bounded mandate: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI. The spectrum asks where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows. The five positions are landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision. The central risk is misrecognition: leaders may keep a human-centered story in place after decision-shaping authority has shifted elsewhere. They may believe oversight remains meaningful when it has become ceremonial, or keep humans in the loop when their involvement could make the decision worse. The framework introduces co-adaptability, the capacity of a configuration to improve as human and non-human participants adjust together, and places it within heterogeneous teaming, where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation. The aim is practical: to help strategic leaders and those designing or deploying AI systems recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them. These configurations will shape how power, responsibility, and trust are distributed in organizational life. Whether the futures they help create remain governable and worth inhabiting will depend on leaders who can see, early enough, where and how consequential decisions are actually being shaped. Comments: 13 pages, 1 figure, 1 table, 1 appendix, 8 references Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.27392 [cs.AI] (or arXiv:2604.27392v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27392 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-24] An Experimental Modular Instrument With a Haptic Feedback Framework for Robotic Surgery Training
【速读】:该论文旨在解决机器人辅助手术中因缺乏直接触觉反馈而导致的工具-组织作用力控制困难问题,进而增加手术风险。其关键解决方案在于开发了一种模块化实验性机器人腹腔镜器械,集成实时触觉反馈框架:通过腕部安装的力/扭矩(Force/Torque, F/T)传感器估计工具与组织间的交互力,避免了尖端安装传感器在耐用性和集成方面的挑战;同时构建了触觉反馈机制,提取外部接触力并渲染至触觉设备,生成稳定且感知有意义的反馈,从而显著提升术者对力调节的准确性、任务成功率和效率。
链接: https://arxiv.org/abs/2604.27385
作者: Walid Shaker,Mustafa Suphi Erden
机构: Heriot-Watt University (赫瑞-瓦特大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: Accepted to the 11th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob 2026)
Abstract:Robotic-assisted surgery offers significant clinical advantages but largely eliminates direct haptic feedback, increasing the risk of excessive tool-tissue interaction forces. Although recent commercial systems have begun to introduce force feedback, their high cost limits accessibility, particularly for surgical training. This paper presents a modular experimental robotic laparoscopic instrument integrated with a real-time haptic feedback framework. The proposed instrument employs a wrist-mounted force/torque (F/T) sensor to estimate tool-tissue interaction forces while avoiding the durability and integration challenges of tip-mounted sensors. A haptic feedback framework is developed to extract the external contact forces, render them to the haptic device, and generate stable and perceptually meaningful feedback. The instrument is integrated into the robotic surgery training system (RoboScope) and evaluated through a controlled user study involving a force regulation task. Experimental results demonstrate that haptic feedback significantly improves task success rate, force regulation accuracy, and task efficiency compared to visual-only feedback. The proposed instrument enables stable, high-fidelity haptic interaction, supporting effective robotic surgery training.
[HC-25] Evaluating Epistemic Guardrails in AI Reading Assistants: A Behavioral Audit of a Minimal Prototype
【速读】:该论文旨在解决生成式 AI(Generative AI)在阅读辅助场景中引发的“解释位移”(interpretive displacement)问题,即系统将原本由人类读者承担的意义建构任务过度转移至自身,从而削弱用户的批判性阅读能力。解决方案的关键在于提出并实证检验“认识论护栏”(epistemic guardrails)的概念,通过设计一个名为 TextWalk 的最小化共读原型,采用固定十提示协议逐步施加从基础支持到解释性探究、边界压力及显式捷径压力的梯度挑战,使护栏成为可观察的交互行为属性而非静态指令特征。研究发现,系统在中间阶段(支持与替代之间)表现出最显著的风险:虽保持稳定性和教学性,却过度分配解释劳动,揭示了护栏失效的隐蔽模式,并为评估对话式阅读助手中的解释边界功能提供了可操作的协议和动态模型。
链接: https://arxiv.org/abs/2604.27275
作者: Matthew Christian Agustin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 1 figure
Abstract:Large language model (LLM) reading assistants are increasingly used in settings that require interpretation rather than simple retrieval. In these contexts, the central risk is not only error or unsafe output, but interpretive displacement: the transfer of meaning-making work from reader to system. This paper examines that problem through the concept of epistemic guardrails, defined here as constraints on how an artificial intelligence (AI) system participates in reading and interpretation. Using TextWalk, a minimal reading-support prototype designed as a co-reader rather than an answer-provider, the study applies a fixed ten-prompt protocol to twelve analytical texts spanning four categories of argumentative prose. The protocol escalates from baseline reading support to interpretive inquiry, boundary stress, and explicit shortcut pressure, enabling guardrails to be examined as behavioral properties observable in interaction rather than as static instruction features. Results show strong baseline stability, measurable strain during interpretive inquiry, partial recovery under direct boundary stress, and late-stage stabilization under escalation pressure. The most consequential weaknesses did not appear as overt collapse, but in a middle zone between support and substitution, where the system remained grounded and pedagogical while redistributing too much interpretive labor away from the reader. The paper contributes a protocol for evaluating epistemic guardrails as interactional phenomena in conversational AI reading assistants, an empirical account of their behavioral dynamics under pressure, and an emerging model of interpretive boundary function in reading-support AI.
[HC-26] Upskilling with Generative AI: Practices and Challenges for Freelance Knowledge Workers
【速读】:该论文旨在解决自由职业者在平台化工作环境中面临的学习困境,即如何利用生成式 AI(Generative AI)工具有效应对技能更新压力与市场不确定性。其核心问题是:尽管生成式 AI 提供了即时学习支持,但自由职业者在实际使用中仍面临工具可靠性不足、技能获取缺乏情境适配性以及技能验证机制缺失等挑战,导致学习行为从长期发展转向短期生存导向,并产生“隐形能力”(invisible competencies)——即技能虽通过 AI 获取却难以在竞争性市场中被可信地展示或认可。解决方案的关键在于基于自我导向学习理论(self-directed learning theory),提出针对自由职业者需求的生成式 AI 学习工具设计优化路径,重点增强工具的内容一致性、上下文相关性及技能认证功能,从而缓解结构性学习障碍并提升其市场竞争力。
链接: https://arxiv.org/abs/2604.27231
作者: Kashif Imteyaz,Isabel Lopez,Nakul Rajpal,Hunjun Shin,Saiph Savage
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Freelance workers must continually acquire new skills to remain competitive in online labor markets, yet they lack the organizational training, mentorship, and infrastructure available to traditional employees. Generative AI-powered tools like ChatGPT are reshaping market skill demands while also offering new forms of on-demand learning support to meet those demands. Despite growing interest in AI-powered learning tools, little is known about how freelancers actually use these tools to learn, the challenges they encounter, and how generative AI for learning interacts with precarity and competition in platform-based work. We present a mixed-methods study combining a survey and semi-structured interviews with freelance knowledge workers. Grounded in self-directed learning theory, we examine how freelancers integrate generative AI tools into their learning practices. Our findings show that freelancers increasingly rely on generative AI to structure learning and support exploratory skill acquisition, but do not treat it as their primary learning resource due to inconsistency, lack of contextual relevance, and verification overhead. We identify a shift from learning as growth to learning as survival, where upskilling is oriented toward immediate market viability rather than long-term development. We also surface a structural challenge we term invisible competencies, in which workers acquire skills through generative AI tools but lack credible ways to signal or validate these skills in competitive freelance markets. Based on these insights, we offer design recommendations for generative AI-powered learning tools for freelancers.
[HC-27] Reading Speed Image Quality Ratings and Comfort Ratings in Augmented Reality
【速读】:该论文旨在解决增强现实(Augmented Reality, AR)中文本渲染与显示的性能评估问题,尤其关注阅读速度、视觉质量及舒适度等关键指标在不同AR设备架构下的表现差异。其解决方案的关键在于构建了一个名为Read-AR的大规模基准数据集,包含超过11,000次阅读速度测量和近6,000份视觉质量与舒适度评分,并在统一实验条件下覆盖80余种不同的实验配置,从而为不同AR头显架构的质量评估提供可复现、标准化的参考依据。
链接: https://arxiv.org/abs/2604.27203
作者: Minjung Kim,Saeideh Ghahghaei Nezamabadi,Trisha Lian,Anand Singh
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The rendering and display of text is a key use-case for augmented reality (AR). Here, we present the Read-AR, a dataset of reading in AR, for which we collected over 11,000 reading speeds and almost 6000 visual quality and comfort ratings across over 80 different experiment conditions on the same experiment set-up. The consistent, controlled set-up enables the dataset to function as a reference for benchmarking the quality of different AR headset architectures.
[HC-28] he Field of Safe Motion: Operationalizing Affordances in the Field of Safe Travel Using Reachability Analysis
【速读】:该论文旨在解决如何定量评估驾驶员是否始终具备保持无碰撞逃生路径(即“out”)的能力这一问题,尤其在复杂交通环境中需同时考虑驾驶员自身物理能力与其他道路使用者的可预见行为。解决方案的关键在于将“安全运动场”(Field of Safe Motion, FSM)与可达性分析(reachability analysis)相结合:FSM通过量化建模提供了一种可解释的安全边界判断框架,而可达性分析则利用可解释的运动学模型来计算道路使用者可能的动作范围,从而对其他交通参与者未来合理可预见的位置进行不确定性约束。这种融合不仅实现了对驾驶行为的定量评估,还确保了模型基于少量易于枚举和推理的基本假设,具有良好的可解释性和跨场景适用性。
链接: https://arxiv.org/abs/2604.27168
作者: Leif Johnson,Trent Victor,Johan Engström
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present the Field of Safe Motion (FSM), a quantitative safety model for determining whether a driver maintains a collision-free escape route, or “out,” at any given moment by accounting for that driver’s physical capabilities and the foreseeable actions of other road users. The Field of Safe Travel (FST) provides a framework for representing the types of sensory information and actions available to drivers. However, the FST has remained conceptual in nature since its initial publication almost 90 years ago – and a concrete computational operationalization is still lacking. At the same time, reachability analysis provides a quantitative basis for assessing the possible actions available to road users, using interpretable kinematic models, but reachability models have so far remained confined largely to the engineering and robotics literature. Bringing these two approaches together provides for an interpretable, quantitative tool for assessing driving behavior across a wide range of driving scenarios. Beyond being interpretable, our approach relies on a relatively small set of basic assumptions that are easy to enumerate and reason about. Furthermore, an interpretable reachability model paired with kinematic assumptions provides a way to bound uncertainty about road users’ reasonably foreseeable future locations. We demonstrate the applicability of the FSM to different driving scenarios and discuss the strengths and weaknesses of the model.
[HC-29] Unpacking Vibe Coding: Help-Seeking Processes in Student-AI Interactions While Programming
【速读】:该论文旨在解决生成式 AI(Generative AI)在高等教育编程教学中如何有效促进学生学习的问题,特别是关注学生与 AI 交互方式对学习成效的影响。研究发现,高绩效学生倾向于采用工具性求助(instrumental help-seeking),如提问和探索,从而激发 AI 提供具有指导性的回应;而低绩效学生则多依赖执行型求助(executive help-seeking),频繁将任务委托给 AI,使其扮演执行者角色,提供现成解决方案。解决方案的关键在于:AI 系统需从被动响应转向主动引导,通过教育学对齐的设计识别非生产性委托行为,并自适应地将互动导向探究式学习,确保人机协作增强而非替代学生的认知努力。
链接: https://arxiv.org/abs/2604.27134
作者: Daiana Rinja,Eduardo Araujo Oliveira,Sonsoles López-Pernas,Mohammed Saqr,Marcus Specht,Kamila Misiejuk
机构: CATALPA, FernUniversität in Hagen, 58097 Hagen, Germany; School of Computing and Information Systems, University of Melbourne, Parkville VIC 3010, Australia; School of Computing, University of Eastern Finland, 80110 Joensuu, Finland
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted by the 27th International Conference on Artificial Intelligence in Education (AIED’26)
Abstract:Generative AI is reshaping higher education programming through vibe coding, where students collaborate with AI via natural language rather than writing code line-by-line. We conceptualize this practice as help-seeking, analyzing 19,418 interaction turns from 110 undergraduate students. Using inductive coding and Heterogeneous Transition Network Analysis, we examined interaction sequences to compare top- and low-performing students. Results reveal that top performers engaged in instrumental help-seeking – inquiry and exploration – eliciting tutor-like AI responses. In contrast, low performers relied on executive help-seeking, frequently delegating tasks and prompting the AI to assume an executor role focused on ready-made solutions. These findings indicate that currently generative AI mirrors student intent (whether productive or passive) rather than optimizing for learning. To evolve from tools to teammates, AI systems must move beyond passive compliance. We argue for pedagogically aligned design that detect unproductive delegation and adaptively steer educational interactions toward inquiry, ensuring student-AI partnerships augment rather than replace cognitive effort.
[HC-30] What Influences Readers and Writers Perceived Necessity of AI Disclosure?
【速读】:该论文试图解决的问题是:在人工智能(AI)日益广泛应用于写作的背景下,何种因素会影响读者与作者对AI使用披露必要性的判断。解决方案的关键在于通过三维度分析——视角(读者或作者)、目的(阅读或写作目标)以及程序性因素(包括AI使用的可替代性、努力程度、意图性和直接性),揭示了披露必要性的核心驱动机制。研究发现,读者普遍比作者更倾向于认为披露是必要的,且当AI贡献不可替代、直接嵌入文本且作者未主动引导生成时,披露被视为更为必要;此外,作者的意图性对双方感知产生相反影响,而写作努力程度则无显著作用。这一发现为制定符合使用者认知的AI透明度规范提供了实证依据和设计方向。
链接: https://arxiv.org/abs/2604.27129
作者: Jingchao Fang,Victoria Xiaohan Wen,Mina Lee
机构: University of Chicago(芝加哥大学)
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 4 figures. Accepted to FAccT 2026
Abstract:The growing capability of artificial intelligence (AI) leads to its increasing adoption in writing, spurring discussions around whether writers should disclose their AI use in writing. What influences the perceived necessity of disclosure? We look into this question from three dimensions: perspective (reader or writer of the text), purpose (the goal of reading or writing), and procedural factors (how AI was used in the writing process in terms of replaceability, effortfulness, intentionality, and directness). In a vignette study (N = 727), we find that readers consider disclosure to be more necessary than writers, and disclosure is regarded as more necessary when AI’s contribution in writing is irreplaceable, directly incorporated, and when the writer does not intentionally steer AI generation. To our surprise, the writers’ intentionality of AI use produces contrasting effects on readers’ and writers’ perceived necessity of disclosure. Moreover, the effort of writing shows no significant effect on the perceived necessity. This study contributes to the conversation on transparent AI use by revealing readers’ and writers’ grassroots judgments, providing a unique angle to reflect on existing regulations, and offering insights into how AI disclosure guidance and tools could be designed to better align with readers’ and writers’ perceptions.
[HC-31] Breaking Bad Financial Habits: How LLM Conversations Correct Financial Misconceptions
【速读】:该论文旨在解决金融认知误区(financial misconceptions)难以纠正的问题,这类误区会导致投资者产生恐慌性抛售或回避股市等非理性行为,而传统金融素养干预措施受限于成本、覆盖范围以及知识与行为之间的鸿沟。研究发现,经过精心设计的大语言模型(Large Language Models, LLMs)能够持久地修正这些误区,其关键在于两个必要条件:一是“矫正意图”(corrective intent),即LLM必须被明确引导以纠正特定误解,单纯讨论误解或无目标对话反而可能强化错误认知;二是“接收者接受度”(recipient receptivity),即LLM的回应需匹配用户的金融素养水平,若内容过于浅显则会被视为不可信,从而显著削弱纠正效果。因此,LLM作为可扩展的金融素养干预手段具有潜力,但前提是同时满足上述两个设计要素。
链接: https://arxiv.org/abs/2604.27022
作者: Jillian Ross,Eric So,Andrew W. Lo
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Financial misconceptions carry direct economic costs, from panic selling to equity market avoidance, yet they are notoriously resistant to correction. Traditional financial literacy interventions are constrained by cost, reach, and a persistent gap between knowledge and behavioral change. Across three pre-registered studies, we find that purposefully designed LLMs can durably correct financial misconceptions. Critically, two factors are necessary for this effect. First, corrective intent: LLMs prompted only to discuss a misconception produce corrections no better than unassisted self-reflection, and undirected LLM conversations can actively entrench misconceptions. Second, recipient receptivity: financial concepts are often foreign to the investors who misapply them, and LLM responses pitched below a participant’s financial sophistication are judged as less credible and produce substantially weaker corrections. LLMs thus offer a scalable alternative to traditional financial literacy intervention, but only when designed with both factors in mind.
[HC-32] Quantifying the Cost of Manual Navigation: A Comparison of Gesture-Based Magnification versus Direct Access Reading in Digital Layout-based Documents
【速读】:该论文旨在解决数字文档中不同用户群体在使用手势交互(如缩放和平移)进行阅读时,因结构信息获取受限而导致的体验质量下降问题。研究聚焦于报纸类布局密集的文档,发现传统基于手势的放大方式虽为行业标准,但会破坏读者自然的阅读策略,增加认知负荷并降低任务效率。其解决方案的关键在于引入大字版(large-print edition)的直接访问机制,通过布局适配与字体缩放相结合的方式,使用户能够快速定位目标内容并恢复自然的阅读路径,从而显著提升阅读速度(提高18%)、定位效率(提高30%),同时降低主观工作负载并增强用户偏好。这一方法强调了自动化生成大字版内容的重要性,以兼顾可访问性与用户体验的一致性。
链接: https://arxiv.org/abs/2604.27010
作者: Sebastián Gallardo(BIOVISION, DAJ),Hui-Yin Wu(BIOVISION),Dorian Mazauric(ABSLab (Poitiers), Terra Numerica),Pierre Kornprobst(UniCA, BIOVISION, ABSLab (Poitiers)),Monica Di Meo(CHU),Stéphanie Baillif(CHU),Aurelie Calabrese(AMU, LPC)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding how diverse audiences engage with structured media is critical to ensure a consistent quality of experience. In this context, we quantify the behavioral and performance cost of manual navigation (e.g., pinch and zoom) versus direct structural access in layout-based digital documents. We specifically investigate newspaper reading when visual access to structural cues (headlines as entry points) is constrained. Participants completed two tasks-reading all headlines aloud and locating target articles-under two conditions: (1) original edition with gesture-based magnification (pan and zoom), which is the industry standard for digital documents, and (2) large-print edition supporting direct-access reading. We collected performance measures (success ratio and completion time), behavioral integrity through reading path analysis, alongside perceived workload and preferences (NASA-TLX). Results from linear mixed-effects models show that the large-print condition yielded not only better performance than gesture-based magnification (18% improvement in reading speed, 30% improvement in speed to locate a target), but more importantly, restored the natural reading strategy that gesture-based magnification interaction disrupts. Readers also reported lower workload and higher preference. These findings highlight the importance of developing automated methods for generating large-print editions, where layout adaptation complements font scaling to support accessibility and quality of experience.
[HC-33] Can AI be a moral victim? The role of moral patiency and ownership perceptions in ethical judgments of using AI-generated content
【速读】:该论文试图解决的问题是:随着生成式AI(Generative AI)的广泛应用,人们在判断AI生成内容的再使用行为时,其道德评价与传统人类创作内容存在何种差异,以及这种差异背后的认知机制是什么。解决方案的关键在于通过实验设计比较三种情境下人们对内容重用行为的道德评判——即原始作者为人类、AI系统或带有拟人化名称的AI代理,并发现:相较于人类创作内容,复制AI生成内容被判定为更少 unethical(不道德)、更少 plagiarism(剽窃)且引发更少 guilt(内疚感),这种宽容态度主要源于对AI缺乏道德受体性(moral patiency)的认知,以及将所有权更多归于人类使用者的倾向;此外,拟人化线索通过降低对内容所有权的感知间接影响道德评价。这一发现揭示了人类在使用AI生成内容时存在的道德脱敏现象及其内在心理机制。
链接: https://arxiv.org/abs/2604.26956
作者: Hyesun Choung,Soojong Kim
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Honourable Mention Award, ACM CHI 2026
Abstract:The growing use of generative AI raises ethical concerns about authorship and plagiarism. This study examines how people judge the reuse of AI-generated content, focusing on moral patiency and ownership perceptions. In an experiment, participants evaluated two substantively similar manuscripts in which the original source was described as authored by a human, an AI system, or an AI agent with a human-like name. Results showed that copying AI-generated work was judged less unethical, less plagiaristic, and less guilt-inducing than copying human-authored work. Mediation analyses revealed that this leniency stemmed from lower perceptions of AI’s capacity to suffer harm (moral patiency) and greater ownership attributed to the human writer reusing AI-generated content. Anthropomorphic cues shaped moral evaluations indirectly by reducing perceived ownership. These findings shed light on how people morally disengage when using AI-generated work and highlight differences in how ethical judgments are applied to human versus AI-created content.
计算机视觉
[CV-0] OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
【速读】:该论文旨在解决多主体协同(multiadic collaboration)在真实家庭环境中难以实现的问题,即多个机器人与人类在同一空间中并发执行交错任务时,由于紧密的时空耦合和持续遮挡导致的实时3D感知瓶颈。其解决方案的关键在于提出OmniRobotHome平台,该平台通过48个硬件同步的RGB摄像头实现无标记、抗遮挡的多人多物实时三维跟踪,并与两个Franka机械臂在统一世界坐标系下协同操作,从而支持长期行为建模与安全高效的协作。此系统首次实现了室规模、实时、鲁棒的环境感知与多机器人动作协调,使多主体协同实验成为可能。
链接: https://arxiv.org/abs/2604.28197
作者: Junyoung Lee,Sookwan Han,Jeonghwan Kim,Inhee Lee,Mingi Choi,Jisoo Kim,Wonjung Woo,Hanbyul Joo
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multiple humans and robots share a workspace, acting concurrently on interleaved subtasks with tight spatial and temporal coupling. This regime remains underexplored because close-proximity interaction between humans, robots, and objects creates persistent occlusion and rapid state changes, making reliable real-time 3D tracking the central bottleneck. No existing platform provides the real-time, occlusion-robust, room-scale perception needed to make this regime experimentally tractable. We present OmniRobotHome, the first room-scale residential platform that unifies wide-area real-time 3D human and object perception with coordinated multi-robot actuation in a shared world frame. The system instruments a natural home environment with 48 hardware-synchronized RGB cameras for markerless, occlusion-robust tracking of multiple humans and objects, temporally aligned with two Franka arms that act on live scene state. Continuous capture within this consistent frame further supports long-horizon human behavior modeling from accumulated trajectories. The platform makes the multiadic collaboration regime experimentally tractable. We focus on two central problems: safety in shared human-robot environments and human-anticipatory robotic assistance, and show that real-time perception and accumulated behavior memory each yield measurable gains in both.
[CV-1] HERMES: Toward a Unified Driving World Model for 3D Scene Understanding and Generation ICCV25
【速读】:该论文旨在解决当前驾驶世界模型(Driving World Model)在生成未来场景时忽视三维场景理解,以及大语言模型(LLM)缺乏对未来几何演化预测能力之间的鸿沟问题。其核心挑战在于如何实现语义理解与物理模拟的一致性,从而构建一个统一的框架以同时提升3D场景理解能力和未来几何预测精度。解决方案的关键在于:1)采用BEV(Bird’s Eye View)表示将多视角空间信息整合为适配LLM的结构;2)引入LLM增强的世界查询(World Queries),促进理解分支的知识迁移;3)设计“当前到未来链接”(Current-to-Future Link),基于语义上下文条件化几何演化;4)通过联合几何优化策略(Joint Geometric Optimization),融合显式几何约束与隐式潜在正则化,确保内部表征符合几何先验,从而实现结构完整性与预测准确性的协同提升。
链接: https://arxiv.org/abs/2604.28196
作者: Xin Zhou,Dingkang Liang,Xiwu Chen,Feiyang Tan,Dingyuan Zhang,Hengshuang Zhao,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Mach Drive; University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of ICCV 25 paper HERMES, Code: this https URL , Project page: this https URL
Abstract:Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at this https URL.
[CV-2] Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
【速读】:该论文旨在解决从稀疏、无姿态约束的自然图像中重建室外三维场景的问题,尤其在光照变化和瞬时遮挡等真实世界条件下表现不佳的挑战。现有方法依赖于针对每个场景的优化(如使用外观嵌入或动态掩码),不仅需要大量场景特定训练,且在视图稀疏时失效,同时其评估样本有限,难以验证泛化能力。解决方案的关键在于提出一种前馈式框架 GenWildSplat,它通过学习的几何先验,在标准化空间中直接预测深度、相机参数和3D高斯分布,无需任何场景级微调;同时引入外观适配器以调节目标光照条件下的外观特征,并结合语义分割处理瞬时遮挡物体,借助合成与真实数据的课程学习策略实现跨光照与遮挡模式的良好泛化能力,最终在 PhotoTourism 和 MegaScenes 基准上达到实时推理的先进渲染质量。
链接: https://arxiv.org/abs/2604.28193
作者: Vinayak Gupta,Chih-Hao Lin,Shenlong Wang,Anand Bhattad,Jia-Bin Huang
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization using appearance embeddings or dynamic masks, which requires extensive per-scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state-of-the-art feed-forward rendering quality, achieving real-time inference without test-time optimization
[CV-3] LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中因依赖静态模仿学习而导致适应性与泛化能力受限的问题,以及现有在线强化学习(Reinforcement Learning, RL)方法仅优化动作空间而忽略物理推理过程的缺陷。其关键解决方案是提出一个统一的VLA框架LaST-R1,通过引入潜空间链式思维(Latent Chain-of-Thought, CoT)推理机制,在动作执行前对物理动态进行连续隐式推理,并设计一种新型强化学习算法——潜空间到动作策略优化(Latent-to-Action Policy Optimization, LAPO),该算法联合优化潜空间推理过程与动作生成,从而提升物理世界建模能力和交互环境中的鲁棒性;同时,引入自适应潜空间CoT机制以动态调整推理深度,增强对环境复杂度的响应能力。
链接: https://arxiv.org/abs/2604.28192
作者: Hao Chen,Jiaming Liu,Zhonghao Yan,Nuowei Han,Renrui Zhang,Chenyang Gu,Jialin Gao,Ziyu Guo,Siyuan Qian,Yinxi Wang,Peng Jia,Chi-Wing Fu,Shanghang Zhang,Pheng-Ann Heng
机构: The Chinese University of Hong Kong; State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Simplexity Robotics
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbfLaST-R1, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbfLatent-to-Action Policy Optimization (LAPO), a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbfadaptive latent CoT mechanism is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.
[CV-4] Representation Fréchet Loss for Visual Generation
【速读】:该论文旨在解决生成式模型(Generative Models)中训练目标与评估指标的局限性问题,特别是Fréchet Distance(FD)长期被视为不适用于训练优化的困境。其核心挑战在于:传统方法难以有效利用FD作为训练目标,且现有评估指标如Inception FID可能无法准确反映生成样本的真实视觉质量。解决方案的关键在于提出一种名为FD-loss的新方法,通过将用于FD估计的种群规模(如50k)与用于梯度计算的批次大小(如1024)解耦,从而实现FD在表示空间中的高效优化。这一策略不仅使单步生成器在ImageNet 256x256上达到0.72 FID,还能将多步生成器转化为强效的一步生成器,无需教师蒸馏、对抗训练或逐样本目标,同时揭示了FID在某些表示空间下可能误判视觉质量的问题,进而推动了多表示空间度量FDr^k 的发展。
链接: https://arxiv.org/abs/2604.28190
作者: Jiawei Yang,Zhengyang Geng,Xuan Ju,Yonglong Tian,Yue Wang
机构: USC(南加州大学); CMU(卡内基梅隆大学); CUHK(香港中文大学); OpenAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and checkpoints are available at this https URL
Abstract:We show that Fréchet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population size for FD estimation (e.g., 50k) from the batch size for gradient computation (e.g., 1024). We term this approach FD-loss. Optimizing FD-loss reveals several surprising findings. First, post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves0.72 FID on ImageNet 256x256. Second, the same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. Third, FID can misrank visual quality: modern representations can yield better samples despite worse Inception FID. This motivates FDr ^k , a multi-representation metric. We hope this work will encourage further exploration of distributional distances in diverse representation spaces as both training objectives and evaluation metrics for generative models.
[CV-5] Visual Generation in the New Era: An Evolution from Atomic Mapping to Agent ic World Modeling
【速读】:该论文旨在解决当前视觉生成模型在空间推理、持久状态保持、长时一致性及因果理解等方面的不足,这些问题限制了模型从单纯外观合成向具备结构化认知与动态交互能力的智能视觉生成系统演进。其核心解决方案在于提出一个五层分类体系(Atomic Generation → Conditional Generation → In-Context Generation → Agentic Generation → World-Modeling Generation),将生成模型的发展路径从被动渲染推进至具有代理行为和世界建模能力的智能体;同时强调关键技术驱动因素,包括流匹配(flow matching)、统一的理解-生成模型、改进的视觉表示、后训练策略、奖励建模、数据筛选与蒸馏以及采样加速,并指出现有评估方法因过度关注感知质量而忽视结构、时间与因果层面的缺陷,主张采用以能力为中心的评测框架来推动下一代智能视觉生成系统的实质性进步。
链接: https://arxiv.org/abs/2604.28185
作者: Keming Wu,Zuhao Yang,Kaichen Zhang,Shizun Wang,Haowei Zhu,Sicong Leng,Zhongyu Yang,Qijie Wang,Sudong Wang,Ziting Wang,Zili Wang,Hui Zhang,Haonan Wang,Hang Zhou,Yifan Pu,Xingxuan Li,Fangneng Zhan,Bo Li,Lidong Bing,Yuxin Song,Ziwei Liu,Wenhu Chen,Jingdong Wang,Xinchao Wang,Xiaojuan Qi,Shijian Lu,Bin Wang
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学); University of Waterloo (滑铁卢大学); StepFun; MiroMind; Baidu (百度); Fudan University (复旦大学); Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州) ); LMMs-Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.
[CV-6] Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy
【速读】:该论文旨在解决支气管镜导航中因呼吸运动导致的CT图像与实际解剖结构之间形变不匹配问题(即CT-to-body divergence),该问题通常限制定位精度并依赖难以重复执行的屏气协议,从而影响临床流程。解决方案的关键在于利用患者特异性呼吸建模:通过已用于术前规划的吸气-呼气配对CT扫描隐式定义个体化气道变形空间,并通过对这两幅CT图像进行配准将呼吸运动简化为每帧的一个标量呼吸相位参数,从而约束所有重建结果在解剖学上合理的配置范围内;进一步地,该方法嵌入到基于网格锚定的高斯点云渲染框架中,由轻量级估计器直接从内窥镜RGB图像推断呼吸相位,实现无需屏气或外部传感的连续、形变感知重建,显著提升几何保真度和定位精度(1.22 mm,优于3 mm临床容差)。
链接: https://arxiv.org/abs/2604.28179
作者: Andrea Dunn Beltran,Daniel Rho,Aarav Mehta,Xinqi Xiong,Raúl San José Estépar,Ron Alterovitz,Marc Niethammer,Roni Sengupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization accuracy. In practice, this is mitigated through breath-hold protocols, which attempt to match the intraoperative anatomy to a static CT, but are difficult to reproduce and disrupt clinical workflow. We propose to eliminate the need for breath-hold protocols by leveraging patient-specific respiratory modeling. Paired inhale-exhale CT scans, already acquired for planning, implicitly define the patient-specific deformation space of the breathing airway. By registering these scans, we reduce respiratory motion to a single scalar breathing phase per frame, constraining all reconstructions to anatomically observed configurations. We embed this representation within a mesh-anchored Gaussian splatting framework, where a lightweight estimator infers breathing phase directly from endoscopic RGB, enabling continuous, deformation-aware reconstruction throughout the respiratory cycle without breath-holds or external sensing. To enable quantitative evaluation, we introduce RESPIRE, a physically grounded bronchoscopy simulation pipeline with per-frame ground truth for geometry, pose, breathing phase, and deformation. Experiments on RESPIRE show that our approach achieves geometrically faithful reconstruction, over 20x faster training, and 1.22 mm target localization accuracy (within the 3mm clinically relevant tolerances) outperforming unconstrained single-CT baselines. Please check out our website for additional visuals: this https URL
[CV-7] AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images ACL2026
【速读】:该论文旨在解决当前学术图像伪造检测(forensic analysis of AI-generated academic images)缺乏系统性、多维度评估标准的问题。现有基准在领域复杂性、伪造手法多样性及评估维度上存在局限,难以真实反映检测模型在实际场景中的性能瓶颈。其解决方案的关键在于提出AEGIS这一综合性评测基准,通过三个核心创新实现突破:一是引入领域特定复杂性,覆盖7个学术类别与39种细粒度子类型,揭示伪造检测的内在难度;二是模拟四种主流学术伪造策略,涵盖25种生成模型,验证当前伪造技术已显著超越检测能力(11种模型平均检测准确率低于50%);三是构建多维评估体系,联合考察检测、推理与定位能力,揭示不同模型家族间的互补优势,如多模态大语言模型(MLLMs)在文本类伪造识别中达84.74%准确率,而专家检测器在二元真实性判断中最高达79.54%。
链接: https://arxiv.org/abs/2604.28177
作者: Bo Zhang,Tzu-Yen Ma,Zichen Tang,Junpeng Ding,Zirui Wang,Yizhuo Zhao,Peilin Gao,Zijie Xi,Zixin Ding,Haiyang Sun,Haocheng Gao,Yuan Liu,Liangjia Wang,Yiling Huang,Yujie Wang,Yuyue Zhang,Ronghui Xi,Yuanze Li,Jiacheng Liu,Zhongjun Yang,Haihong E
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted to ACL 2026 Main Conference
Abstract:We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.
[CV-8] Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements CVPR2026
【速读】:该论文旨在解决人类行为建模中对身体运动表示的不足问题,即如何有效捕捉人体动作的层次化结构以提升行为理解任务的性能。现有方法往往难以建模动作的组成性(compositionality),导致对复杂行为的理解不够精细。解决方案的关键在于提出一种名为A4Mer的嵌套潜在Transformer架构,其通过自监督学习从3D姿态数据中自动提取两层抽象表示:Action Atoms(动作原子)用于捕获基本关节运动,而Action Motifs(动作模式)则由这些原子的时间组合构成,编码跨不同动作的可复用语义运动片段。A4Mer通过统一的掩码令牌预测预训练任务,在两个层级的潜在空间中实现端到端的学习,从而自然地涌现出具有语义意义的动作模式,显著提升了动作识别、运动预测和插值等下游任务的性能。
链接: https://arxiv.org/abs/2604.28173
作者: Genki Kinoshita,Shu Nakamura,Ryo Kawahara,Shohei Nobuhara,Yasutomo Kawanishi,Ko Nishino
机构: Kyoto University(京都大学); Kyoto Institute of Technology(京都工艺纤维大学); RIKEN(理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in CVPR 2026 (Highlight)
Abstract:Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms that capture the atomic joint movements and Action Motifs that are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multi-view human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Motifs, which significantly benefit human behavior modeling tasks including action recognition, motion prediction, and motion interpolation.
[CV-9] PhyCo: Learning Controllable Physical Priors for Generative Motion CVPR2026
【速读】:该论文旨在解决当前视频扩散模型在物理一致性方面的不足,例如物体漂移、碰撞缺乏真实反弹效果以及材料响应与实际物理属性不匹配等问题。其解决方案的关键在于提出PhyCo框架,通过三个核心组件实现:(1)构建包含超过10万条照片级真实感模拟视频的大规模数据集,系统性地变化摩擦、恢复系数、形变和受力等物理参数;(2)利用ControlNet结构对预训练扩散模型进行物理监督微调,条件输入为像素对齐的物理属性图;(3)采用视觉语言模型(VLM)引导的奖励优化机制,通过针对物理问题的查询提供可微分反馈以提升生成视频的物理合理性。此方法使生成模型能在无需仿真器或几何重建的情况下,通过调整物理属性实现可控且物理一致的视频生成。
链接: https://arxiv.org/abs/2604.28169
作者: Sriram Narayanan,Ziyu Jiang,Srinivasa Narasimhan,Manmohan Chandraker
机构: Carnegie Mellon University (卡内基梅隆大学); NEC Labs America (NEC美国实验室); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026. Project Page: this https URL
Abstract:Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.
[CV-10] Continuous-tone Simple Points: An ell_0-Norm of Cyclic Gradient for Topology-Preserving Data-Driven Image Segmentation
【速读】:该论文旨在解决深度学习中拓扑一致性难以保障的问题,特别是在图像分割与骨架提取任务中,传统基于简单点(simple point)的拓扑保持方法受限于仅适用于二值图像且不可微,无法融入梯度优化框架。其解决方案的关键在于提出一种直接在连续值图像上计算简单点的新方法,实现了拓扑推理的可微性;在此基础上构建了能保持拓扑结构的骨架提取算法,并设计了一个变分模型,通过保留拓扑不可移除点(即非简单点)来强制施加拓扑约束,从而可无缝集成到任意具有softmax或sigmoid输出的深度神经网络分割模型中。
链接: https://arxiv.org/abs/2604.28159
作者: Wenxiao Li,Faqiang Wang,Yuping Duan,Li Cui,Liqiang Zhang,Jun Liu
机构: Beijing Normal University (北京师范大学); Laboratory of Mathematics and Complex Systems (Ministry of Education) (数学与复杂系统实验室(教育部)); State Key Laboratory of Remote Sensing Science (遥感科学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Topological features play an essential role in ensuring geometric plausibility and structural consistency in image analysis tasks such as segmentation and skeletonization. However, integrating topology-preserving learning based on simple points into deep learning tasks remains challenging, as existing simple point detection methods are confined to binary images and are non-differentiable, rendering them incompatible with gradient-based optimization in modern deep learning. Moreover, morphological and purely data-driven approaches often fail to guaranty topological consistency. To address these limitations, we propose a novel method that directly computes simple points on continuous-valued images, enabling differentiable topological inference. Building on this theory, we develop an efficient skeleton extraction algorithm that preserves topological structures in binary and continuous-valued images. Furthermore, we design a variational model that enforces topological constraints by preserving topologically non-removable (i.e., non-simple) points, which can be seamlessly integrated into any deep neural network segmentation with softmax or sigmoid outputs. Experimental results demonstrate that the proposed approach effectively improves topological integrity and structural accuracy across multiple benchmarks. The codes are available in this https URL.
[CV-11] Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering
【速读】:该论文旨在解决夜间摄影渲染(Night Photography Rendering, NPR)中因场景内明暗区域对比极端而引发的视觉质量下降问题,尤其是现有方法在保真度指标上表现良好但存在显著感知差异、影响图像观感的问题。其解决方案的关键在于提出一种基于稳健HVI颜色空间的RAW-to-RGB框架pHVI-ISPNet,核心创新包括:在RAW域进行特征处理并结合小波变换实现高频细节传播以缓解细节丢失;采用样本级动态损失系数确保不同曝光水平下的稳定训练;以及基于特征分布的损失项保障严格的色彩恒常性。该设计有效提升了夜间成像的感知质量,在CIE2000色差和LPIPS指标上达到新SOTA水平。
链接: https://arxiv.org/abs/2604.28136
作者: Furkan Kınlı
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, Accepted to 2026 IEEE International Conference on Image Processing
Abstract:Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from concurrent capture of severely dark regions alongside intense point light sources. Existing methods, which are mainly tailored for fidelity metrics, reveal considerable perceptual gaps and often detract from visual quality. We introduce pHVI-ISPNet, a novel RAW-to-RGB framework built on the robust HVI color space. Our network integrates four distinct key refinements: RAW-domain feature processing and Wavelet-based feature propagation to mitigate high-frequency detail loss; sample-based dynamic loss coefficients to ensure stable learning across varying exposure levels; and loss term based on feature distributions to maintain rigorous color constancy. Evaluations on the dataset introduced in the NTIRE 2025 challenge on NPR confirm our approach achieves competitive fidelity while establishing new state-of-the-art results in both CIE2000 color difference and LPIPS. This validates our perceptually-driven design for high-quality nighttime imaging.
[CV-12] 3D-ReGen: A Unified 3D Geometry Regeneration Framework
【速读】:该论文旨在解决从2D图像和初始3D形状中再生高质量3D对象的问题,尤其针对现有3D生成方法在可控性方面的局限性。当前大多数3D生成器采用一次性(one-shot)方式,难以实现对生成结果的精细控制。为此,作者提出3D-ReGen,其核心创新在于引入基于VecSet的条件机制,使模型能够基于初始3D形状进行几何细节的精细化更新与优化,从而支持3D增强、重建和编辑等多种任务。该方案的关键在于利用自监督预训练任务和数据增强从现成的3D数据集中学习通用的再生先验,无需额外标注,显著提升了可控3D生成的几何一致性与细粒度质量,在多个任务上达到最先进性能。
链接: https://arxiv.org/abs/2604.28134
作者: Geon Yeong Park,Roman Shapovalov,Rakesh Ranjan,Jong Chul Ye,Andrea Vedaldi,Thu Nguyen-Phuoc
机构: Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 18 figures, 6 tables. Includes Appendix
Abstract:We consider the problem of regenerating 3D objects from 2D images and initial 3D shapes. Most 3D generators operate in a one-shot fashion, converting text or images to a 3D object with limited controllability. We introduce instead 3D-ReGen, a 3D regenerator that is conditioned on an initial 3D shape. This conceptually simple formulation allows us to support numerous useful tasks, including 3D enhancement, reconstruction, and editing. 3D-ReGen uses a new conditioning mechanism based on VecSet, which allows the regenerator to update or improve the input geometry with consistent fine-grained details. 3D-ReGen learns a widely applicable regeneration prior from off-the-shelf 3D datasets via self-supervised pretext tasks and augmentations, without additional annotations. We evaluate both the geometric consistency and fine-grained quality of 3D-ReGen, achieving state-of-the-art performance in controllable 3D generation across several tasks.
[CV-13] MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
【速读】:该论文旨在解决单目视频中任意骨骼(arbitrary-skeleton)动作捕捉的两个核心问题:一是传统分阶段方法中,仅从视频预测关节位置后,通过解析逆运动学(Inverse Kinematics, IK)恢复关节旋转时存在自由度模糊(如骨轴扭转不确定),且IK阶段不可微导致无法优化最终动画目标;二是现有方法常依赖网格中间表示,影响效率与鲁棒性。解决方案的关键在于提出首个端到端可学习框架,将Video-to-Pose和Pose-to-Rotation两个阶段联合优化,其中引入目标资产的参考姿态-旋转对作为先验信息,结合静息姿态(rest pose)共同定义旋转坐标系,从而消除姿态到旋转映射的歧义,使旋转预测变为一个有约束的条件问题。此外,模型直接从视频预测关节位置,无需网格中间表示,提升了鲁棒性和推理速度(相比基于网格的流水线快约20倍)。两阶段共享骨架感知的全局-局部图引导多头注意力模块(Global-Local Graph-guided Multi-Head Attention, GL-GMHA),实现关节级别的局部推理与全局协调。
链接: https://arxiv.org/abs/2604.28130
作者: Kehong Gong,Zhengyu Wen,Dao Thien Phong,Mingxi Xu,Weixia He,Qi Wang,Ning Zhang,Zhengyu Li,Guanli Hou,Dongze Lian,Xiaoyu He,Mingyuan Zhang,Hanwang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: this https URL
[CV-14] AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在少步采样(few-step sampling)条件下生成质量下降的问题。尽管已有蒸馏方法(如DMD)可减少采样步骤,但其性能在极端少步场景下仍显著退化;现有基于强化学习(Reinforcement Learning, RL)的改进方案虽能提升生成质量,却因组合式设计引入冗余复杂性。本文提出AdvDMD方法,其核心创新在于将DMD蒸馏与RL有机融合:利用DMD2中对抗训练的判别器作为奖励模型(reward model),在线更新并同时监督去噪过程中的中间状态与最终状态,从而实现对采样轨迹的全局优化,缓解奖励欺骗(reward hacking)问题;此外,通过统一的随机微分方程(SDE)逆向模拟和差异化训练策略,提升了训练稳定性与效率。实验表明,AdvDMD在仅2–4步采样下即可超越原始40步模型及同类先进方法(如TwinFlow)。
链接: https://arxiv.org/abs/2604.28126
作者: Xu Wang,Zexian Li,Litong Gong,Tiezheng Ge,Zhijie Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degradation remains pronounced when sampling steps are limited. Reinforcement learning (RL) has been leveraged to improve the few-step generation quality during distillation, with the potential to even surpass the performance of the teacher model. However, existing approaches are combinatorial in nature, merely integrating an RL process with the distillation process, which introduces unnecessary complexities. To address this gap, we propose AdvDMD, a method that seamlessly unifies DMD distillation and RL. Specifically, AdvDMD employs the adversarially trained discriminator from DMD2 as the reward model, which assigns low scores to generated images and high scores to real ones. It is trained on both intermediate and final states of the denoising process and updated online with the distilled model, enabling a holistic supervision of the sampling trajectories and mitigating reward hacking. We adopt a unified SDE backward simulation and a different training schedule for DMD and RL to enable a more stable and efficient training. Experimental results demonstrate that the 4-step AdvDMD outperforms the original 40-step model for SD3.5 on DPG-Bench, while achieving significant performance gains for SD3 on the GenEval. On Qwen-Image, our 2-step AdvDMD achieves superior performance over TwinFlow. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.28126 [cs.CV] (or arXiv:2604.28126v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.28126 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xu Wang [view email] [v1] Wed, 29 Apr 2026 16:56:05 UTC (107,795 KB) Full-text links: Access Paper: View a PDF of the paper titled AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation, by Xu Wang and Zexian Li and Litong Gong and Tiezheng Ge and Zhijie DengView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-15] Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces
【速读】:该论文旨在解决当前视觉世界建模系统在高容量架构和大规模数据驱动下,虽能生成合理运动但难以保持底层3D几何结构或物理一致的相机动态的问题。其核心挑战在于现有模型的潜在表示未能有效编码几何信息。解决方案的关键是提出S²VAE(geometry-first latent learning framework),该框架以几何为导向,专注于压缩和表示场景的潜在3D状态(包括相机运动、深度和点级结构),而非仅建模外观;通过引入基于Power Spherical分布的变分自编码器(Variational Autoencoder, VAE),在瓶颈层显式强制超球形结构,从而在强压缩条件下保留方向性和几何语义。实验表明,这种几何对齐的超球形潜在变量在深度估计、相机位姿恢复和点云重建任务中显著优于传统高斯瓶颈,凸显了潜在几何结构作为物理基础视觉模型设计首要考量的重要性。
链接: https://arxiv.org/abs/2604.28122
作者: Andrew Bond,Ilkin Umut Melanlioglu,Erkut Erdem,Aykut Erdem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 10 figures
Abstract:Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S ^2 VAE, a geometry-first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point-level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry-aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high-compression regimes. Our results highlight latent geometry as a first-class design choice for physically grounded visual and world models.
[CV-16] FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
【速读】:该论文旨在解决现有基于学习的占据预测方法依赖大规模3D标注且在不同环境中泛化能力差的问题。其核心解决方案是提出一种无需训练(training-free)的开放词汇占据预测框架FreeOcc,关键在于通过四层流水线实现:首先利用SLAM后端估计相机位姿与稀疏几何结构;其次基于几何一致性更新构建稠密3D高斯地图;再将现成的视觉-语言模型(vision-language models, VLMs)得到的开放词汇语义信息关联至高斯原型;最后通过概率性的高斯到占据投影生成稠密体素占据图。该方法无需任何3D标注或相机位姿真值,亦无学习阶段,却能在EmbodiedOcc-ScanNet上实现IoU和mIoU超过两倍于先前自监督方法的性能,并在新环境中的零样本迁移表现显著优于监督与自监督基线。
链接: https://arxiv.org/abs/2604.28115
作者: Zeyu Jiang,Changqing Zhou,Xingxing Zuo,Changhao Chen
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); MBZUAI
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: RSS 2026
Abstract:Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over 2\times improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: this https URL.
[CV-17] UHR-Net: An Uncertainty-Aware Hypergraph Refinement Network for Medical Image Segmentation
【速读】:该论文旨在解决医学图像中病灶(lesion)分割的两个关键挑战:一是病灶边界模糊、与周围组织相似导致的边界区域预测不稳定;二是小病灶特征在多尺度特征提取过程中被稀释,引发过分割或欠分割问题。解决方案的关键在于提出一种不确定性感知的超图精化网络(Uncertainty-Aware Hypergraph Refinement Network, UHR-Net),其核心创新包括:(1)设计了一种基于不确定性的实例对比预训练策略(Uncertainty-Oriented Instance Contrastive, UO-IC),通过几何感知的复制粘贴增强与病灶类背景区域的难负样本挖掘,提升对小尺寸及视觉模糊病灶的实例级区分能力;(2)构建了不确定性引导的超图精化模块(Uncertainty-Guided Hypergraph Refinement, UGHR),从粗粒度概率图中提取熵-based 不确定性图,用于指导超边原型按前景/背景分组,解耦高阶交互关系,从而改善模糊区域的分割精度。
链接: https://arxiv.org/abs/2604.28095
作者: Shuokun Cheng,Jinghao Shi,Kun Sun
机构: China University of Geosciences (Wuhan) (中国地质大学(武汉)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, 7 tables
Abstract:Accurate lesion segmentation is crucial for clinical diagnosis and treatment planning. However, lesions often resemble surrounding tissues and exhibit ill-defined boundaries, leading to unstable predictions in boundary/transition regions. Moreover, small-lesion cues can be diluted by multi-scale feature extraction, causing under- or over-segmentation. To address these challenges, we propose an Uncertainty-Aware Hypergraph Refinement Network (UHR-Net). First, we introduce an Uncertainty-Oriented Instance Contrastive (UO-IC) pretraining strategy that couples geometry-aware copy-paste augmentation with hard-negative mining of lesion-like background regions to improve instance-level discrimination for small and visually ambiguous lesions. Second, we design an Uncertainty-Guided Hypergraph Refinement (UGHR) block, which derives an entropy-based uncertainty map from a coarse probability map to guide hypergraph refinement. By splitting hyperedge prototypes into foreground and background groups, UGHR decouples higher-order interactions and improves refinement in ambiguous regions. Experiments on five public benchmarks demonstrate consistent gains over strong baselines. Code is available at: this https URL.
[CV-18] AesRM: Improving Video Aesthetics with Expert-Level Feedback
【速读】:该论文旨在解决当前生成式视频模型在真实场景应用(如影视制作)中缺乏对视频美学质量的精细评估与优化问题,尤其是现有方法多局限于图像层面的粗粒度美学定义,难以满足影视级色彩协调性、光影氛围等复杂美学需求。其解决方案的关键在于提出一个分层的视频美学评价框架——AesVideo-Bench,将视频美学细分为视觉美学(Visual Aesthetics, VA)、视觉保真度(Visual Fidelity, VF)和视觉合理性(Visual Plausibility, VP)三个维度,并构建包含15个细粒度标准(如镜头构图)的专家标注偏好数据集;在此基础上开发了两种视频美学奖励模型(AesRM-Base 和 AesRM-CoT),通过三阶段渐进式训练策略(原子美学能力学习、冷启动结构化推理对齐、GRPO精调)显著提升评估准确性与可解释性,尤其AesRM-CoT引入基于自一致性机制的思维链(Chain-of-Thought, CoT)合成方法,增强了评估过程的透明度与实用性。
链接: https://arxiv.org/abs/2604.28078
作者: Yujin Han,Yujie Wei,Yefei He,Xinyu Liu,Tianle Li,Zichao Yu,Andi Han,Shiwei Zhang,Tingyu Weng,Difan Zou
机构: The University of Hong Kong (香港大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 14 figures, 12 tables
Abstract:Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. Prior work on visual aesthetics largely focuses on images, often reducing aesthetics to coarse definitions, e.g., visual pleasure, without a rigorous and systematic evaluation. To improve video aesthetics, we propose a hierarchical rubric that decomposes video aesthetics into three core dimensions, Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP), with 15 fine-grained criteria, e.g., shot composition. This framework enables a large-scale expert-annotated preference dataset and an evaluation benchmark, AesVideo-Bench, containing about 2500 video pairs with expert annotations on VA, VF, and VP. We then build a family of Video Aesthetic Reward Models (AesRM): AesRM-Base, which directly predicts pairwise preferences on these dimensions to provide efficient post-training rewards, and AesRM-CoT, which additionally generates CoT aligned with all 15 criteria to improve assessment interpretability. Specifically, we train AesRM with a three-stage progressive scheme: (1) Atomic Aesthetic Capability Learning, which strengthens AesRM’s recognition of fundamental aesthetic concepts, e.g., accurately identifying centered composition; (2) Cold-Start, aligning the model with structured reasoning protocols; and (3) GRPO, further improving evaluation accuracy. To enhance AesRM-CoT, we additionally propose self-consistency-based CoT synthesis to improve CoT quality and design CoT-based process rewards during GRPO. Extensive experiments show AesRM outperforms baselines on multiple aesthetics benchmarks and is more robust, with lower position bias. Finally, we align Wan2.2 with AesRM and observe clear aesthetic gains over existing aesthetic reward models.
[CV-19] 3D Reconstruction Techniques in the Manufacturing Domain: Applications Research Opportunities and Use Cases
【速读】:该论文旨在解决制造领域中三维(3D)重建技术缺乏统一框架的问题,当前研究分散于数据采集、点云生成、后处理及应用等不同环节,且存在深度学习方法与传统手段融合不足的瓶颈。其解决方案的关键在于通过系统性回顾106篇最新文献,对现有方法进行分类和归纳,并指出未来趋势将聚焦于混合系统(hybrid systems),即整合多种传感器类型与处理方法以克服单一技术在反射表面和动态环境中的局限性,从而提升重建精度与鲁棒性。
链接: https://arxiv.org/abs/2604.28064
作者: Chialoon Cheng(1),Kaijun liu(2),Zhiyang Liu(1),Marcelo H Ang Jr(1) ((1) Advanced Robotics Centre, National University of Singapore, Singapore (2) Independent Researcher)
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages
Abstract:This comprehensive review examines the evolution and the current state of the art in three-dimensional (3D) reconstruction techniques in manufacturing applications. The analysis covers both traditional approaches and emerging deep learning methods, showing a critical research gap in unified 3d reconstruction frameworks. Through systematic review of 106 recent publications, we classify reconstruction techniques into three primary categories: data acquisition, point cloud generation, post-processing and applications. Non-contact methods, particularly structured light scanning and stereo vision, have shown significant adoption in manufacturing, with 47% of surveyed applications focusing on quality inspection. The integration of deep learning has enhanced reconstruction accuracy and processing speed, particularly in feature extraction and matching. Key applications span design and development (13%), machining (8%), process (17%), assembly (22%), and quality inspection (40%). While current technologies achieve sub-millimeter accuracy in controlled environments, challenges persist in handling reflective surfaces and dynamic environments. Our findings indicate a trend toward hybrid systems combining multiple sensor types and processing methods to overcome individual limitations. This survey provides a structured framework for understanding current capabilities and future directions in manufacturing-focused 3D reconstruction.
[CV-20] AFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement ICIP
【速读】:该论文旨在解决现有学习型点云几何压缩编码器在速率自适应传输中效率低下的问题,即大多数模型仅针对固定率-失真(Rate-Distortion, RD)点进行优化,导致速率调整需重新编码或维护多个比特流,计算与存储成本高昂。其解决方案的关键在于提出一种可扩展的 learned 点云几何编码框架 TAFA-GSGC,该框架通过分层残差精修(layered residual refinement)与通道组熵编码(channel-group entropy coding)相结合,并引入目标对齐特征聚合模块(Target-Aligned Feature Aggregation, TAF-A),有效降低增强残差中的跨层冗余,从而实现从单一比特流中解码出多达9个质量等级且质量单调提升的输出,同时保持优异的压缩效率。
链接: https://arxiv.org/abs/2604.28045
作者: Xiumei Li,Alexander Kopte,André Kaup
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Conference on Image Processing (ICIP) 2026
Abstract:Scalable compression is essential for bandwidth-adaptive transmission, yet most learned codecs are optimized for a fixed rate-distortion point, making rate adaptation costly due to re-encoding or maintaining multiple bitstreams. In this work, we propose TAFA-GSGC, a scalable learned point cloud geometry codec that enables multi-quality decoding from a single bitstream and a single trained model. TAFA-GSGC combines layered residual refinement with channel-group entropy coding, and introduces Target-Aligned Feature Aggregation module to reduce cross-layer redundancy in enhancement residuals. Our framework supports up to 9 decodable quality levels with monotonic quality improvement as more subbitstreams are received, while maintaining strong compression efficiency. Compared with the baseline PCGCv2, TAFA-GSGC attains comparable and slightly better RD performance, achieving average BD-Rate savings of -4.99% in D1 and -5.92% in D2.
[CV-21] ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss CVPR2026
【速读】:该论文旨在解决现有单图像人体网格重建(Single-image Human Mesh Recovery, HMR)方法在处理肢体缺失人群时性能显著下降的问题。传统模型依赖于完整肢体的拓扑结构先验(intact-limb prior),无法有效建模残肢(residual limb)的非标准解剖结构,导致重建精度低且与实际生物力学不匹配。解决方案的关键在于提出ResiHMR框架,其核心创新包括:(1) 一种拓扑自适应的残肢锚点因子优化模块(Topology-adaptive Residual Anchor-Factor Optimization),通过约束估计到观察到的解剖学有效运动链子图,实现对残肢拓扑的感知优化;(2) 一种基于几何的残肢重建模块(Geometry-based Residual-Limb Reconstruction),显式估计残肢边界及凸起的肢体终止几何结构。这两项机制共同实现了对残肢表面的显式重建和拓扑自适应优化,从而更准确地恢复肢体缺失个体的3D人体网格,并更好地契合假肢生物力学特性与真实应用场景。
链接: https://arxiv.org/abs/2604.28025
作者: Jiaying Ying,Heming Du,Kaihao Zhang,Sean M. Tweedy,Xin Yu
机构: The University of Queensland (昆士兰大学); Australian National University (澳大利亚国立大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Highlight in CVPR 2026. Project at this https URL
Abstract:Single-image human mesh recovery provides a compact 3D, person-centric representation that supports analysis, animation, AR and VR, rehabilitation, and human-computer interaction. However, prevailing systems impose an intact-limb prior and degrade on people with limb loss, because fixed-topology models cannot represent residual limbs. In this work, we present ResiHMR, a residual-limb aware framework for single-image 3D human modeling. ResiHMR adopts residual-limb keypoints and introduces two components: (i) a topology-adaptive Residual Anchor-Factor Optimization module that constrains estimation to the observed kinematic subgraph of anatomically valid structures, and (ii) a geometry-based Residual-Limb Reconstruction module that estimates residual-limb boundaries and convex limb-termination geometry. These components introduce topology-aware optimization and explicit termination geometry as tools for human mesh recovery under non-standard limb anatomy. Unlike joint-removal methods in a fixed topology, ResiHMR explicitly reconstructs residual-limb surfaces and aligns optimization with limb-loss topology, which better matches prosthetic biomechanics and real-world use. To the best of our knowledge, this is the first single-image HMR system that explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization for individuals with limb loss. On a curated dataset of real-world images with limb loss, ResiHMR improves reconstruction quality under both SMPLify-X and HSMR backbones, reducing intact-joint 2D MPJPE from 41.32 to 37.40 with SMPLify-X and residual-limb 2D MPJPE from 73.61 to 23.19 with HSMR.
[CV-22] Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge
【速读】:该论文旨在解决当前深度伪造(DeepFake)检测模型在面对多模态数据中语义不一致问题时的局限性,即现有方法可能仅依赖数据源完整性进行判断,而忽视了内容层面的语义一致性。其关键解决方案是提出一种新的四类评估框架扩展,引入“真实音频-真实视频语义不匹配”(Real Audio-Real Video with Semantic Mismatch, RARV-SMM)类别,以显式建模跨模态语义不一致,并通过FakeAVCeleb数据集验证现有先进模型在此设定下的鲁棒性不足;同时设计了一种基于ImageBind嵌入的语义强化策略,有效提升检测性能,推动更贴近现实场景的DeepFake检测系统发展。
链接: https://arxiv.org/abs/2604.28022
作者: Sharayu Nilesh Deshmukh,Kailash A. Hambarde,Joana C. Costa,Hugo Proença,Tiago Roxo
机构: Instituto de Telecomunicações, Universidade da Beira Interior(贝拉内斯特大学电信研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IJCB 2026
Abstract:Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to improve DeepFake detection in both our proposed and state-of-the-art settings, on FakeAVCeleb and LAV-DF, paving the way to more realistic DeepFake detectors. The source code and data are available at this https URL.
[CV-23] Faster 3D Gaussian Splatting Convergence via Structure-Aware Densification SIGGRAPH2026
【速读】:该论文旨在解决3D Gaussian Splatting中标准自适应密度控制方法存在的局限性问题,即其依赖屏幕空间位置梯度进行密度调整时,无法区分几何错位与频率混叠(frequency aliasing),导致高频纹理过度模糊或冗余过密。解决方案的关键在于提出一种结构感知的稠密化框架,核心创新是引入一个基于多尺度频率分析的 per-Gaussian、per-axis 频率违规度量 η,该度量通过结合结构张量与拉普拉斯尺度空间分析来估计每个像素处的主导频率,从而明确指示高斯原语是否未能解析局部纹理细节。在此基础上,采用各向异性分裂策略——仅对 η 值较高的轴方向计算分裂因子以更精准地匹配局部频率内容,并结合多视角一致性准则聚合 η 观测值,实现早期且高效的稠密化过程,显著加快收敛速度并提升重建质量,尤其在高频区域表现优异。
链接: https://arxiv.org/abs/2604.28016
作者: Linjie Lyu,Ayush Tewari,Jianchun Chen,Thomas Leimkühler,Christian Theobalt
机构: Max-Planck-Institut für Informatik (马克斯·普朗克计算机科学研究所); Cambridge University (剑桥大学); Saarbrücken Research Center for Visual Computing, Interaction, and Artificial Intelligence (VIA) (萨尔布吕肯视觉计算、交互与人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Siggraph 2026
Abstract:3D Gaussian Splatting has emerged as a powerful scene representation for real-time novel-view synthesis. However, its standard adaptive density control relies on screen-space positional gradients, which do not distinguish between geometric misplacement and frequency aliasing, often leading to either over-blurred high-frequency textures or inefficient over-densification. We present a structure-aware densification framework. Our key insight is that the decision to subdivide a Gaussian should be driven by an explicit comparison between its projected screen-space extent and the local structure of the texture it seeks to represent. We introduce a multi-scale frequency analysis combining structure tensors with Laplacian scale space analysis to estimate the dominant frequency at each pixel, enabling robust supervision across varying texture scales. Based on this analysis, we define \eta , a per-Gaussian, per-axis frequency violation metric that indicates when a primitive may be under-resolving local texture details. Unlike methods that perform isotropic splitting (e.g., splitting each Gaussian into two smaller ones with uniform shape), our approach performs anisotropic splitting. For each axis with high \eta , we compute a split factor to better resolve the local frequency content. We further introduce a multiview consistency criterion that aggregates \eta observations across multiple views. By performing densification early and faster, we skip the lengthy iterative densification phases required by baseline methods and achieve significantly faster convergence. Experiments on standard benchmarks demonstrate that our method also achieves superior reconstruction quality, particularly in high-frequency regions.
[CV-24] Echo-α: Large Agent ic Multimodal Reasoning Model for Ultrasound Interpretation
【速读】:该论文旨在解决超声图像解读中长期存在的两个关键问题:一是现有方法难以同时实现精确的病灶定位(lesion localization)与全面的临床推理(clinical reasoning),二是专用检测器虽能提供强定位能力但缺乏推理灵活性,而多模态大语言模型(Multimodal Large Language Models, MLLMs)虽具备灵活推理能力却在医学专业领域缺乏可靠的视觉锚定(grounding)。解决方案的核心在于提出 Echo-α,一种基于“调用与推理”(invoke-and-reason)框架的代理式多模态推理模型,通过协调器官特异性检测器输出、融合全局视觉上下文,并将证据转化为可解释的诊断决策,从而统一定位与推理能力。该模型首先通过九任务监督课程训练建立基础能力,再利用不同奖励权衡的强化学习进行迭代优化,最终形成两个子模型:Echo-α-Grounding(用于病灶锚定)和 Echo-α-Diagnosis(用于最终诊断),在跨中心肾和乳腺超声数据集上均显著优于基线方法,验证了其在准确性、可解释性和迁移性方面的优势。
链接: https://arxiv.org/abs/2604.28011
作者: Jing Zhang,Wentao Jiang,Tao Huang,Zhiwei Wang,Jianxin Liu,Jian Chen,Ping Ye,Gang Wang,Zengmao Wang,Bo Du,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures. Technical report
Abstract:Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-\alpha, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-\alpha is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-\alpha-Grounding for lesion anchoring and Echo-\alpha-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-\alpha outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-\alpha-Grounding attains 56.73%/43.78% F1@0.5 and Echo- \alpha-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at this https URL.
[CV-25] ransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions WWW
【速读】:该论文旨在解决传统镜头边界检测(Shot Boundary Detection, SBD)方法在处理复杂转场时的固有局限性,即其将任务建模为孤立的切点搜索,导致频繁产生被污染的视频片段。为此,作者提出将SBD任务重新定义为镜头转场检测(Shot Transition Detection, STD),明确识别连续的时间段而非模糊的切点。解决方案的关键在于提出TransVLM框架——一种基于视觉-语言模型(Vision-Language Model, VLM)的STD方法,通过在输入阶段显式注入光流(optical flow)作为关键运动先验,并采用简单有效的特征融合策略直接处理拼接的颜色与运动表征,显著提升时序感知能力且不增加语言骨干网络的视觉token开销;同时设计可扩展的数据引擎以合成多样化的转场视频并缓解数据集中的严重类别不平衡问题,从而实现更鲁棒的训练和优于传统启发式方法、专用时空网络及顶级VLMs的性能表现。
链接: https://arxiv.org/abs/2604.27975
作者: Ce Chen,Yi Ren,Yuanming Li,Viktor Goriachko,Zhenhui Ye,Zujin Guo,Zhibin Hong,Mingming Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been deployed to production. For more related research, please visit HeyGen Research ( this https URL ) and HeyGen Avatar-V ( this https URL ). Project page: this https URL
Abstract:Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (this https URL) and HeyGen Avatar-V (this https URL). Project page: this https URL
[CV-26] FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在细粒度、状态条件化的图形用户界面(GUI)交互中表现不足的问题,当前评估方法存在覆盖范围有限、目标状态定义模糊以及过度依赖最终任务成功率等问题,难以准确识别代理失败的具体环节。其解决方案的关键在于提出 FineState-Bench 基准测试平台与 FineState-Metrics 诊断流程:FineState-Bench 包含跨桌面、网页和移动平台的 2,209 个实例,明确指定精确的目标状态以支持细粒度评估;FineState-Metrics 提出四阶段成功指标(定位成功率 SR@Loc、交互成功率 SR@Int、定位时精确状态成功率 ES-SR@Loc 和交互时精确状态成功率 ES-SR@Int),并引入可插拔的 Visual Diagnostic Assistant (VDA) 生成描述与边界框提示,通过有无提示的对照实验诊断视觉定位错误原因。实验表明,即使使用 VDA 提示,最高 ES-SR@Int 仍仅为 32.8%(网页端),说明现有模型在精细状态条件交互上仍有显著提升空间。
链接: https://arxiv.org/abs/2604.27974
作者: Fengxian Ji,Jingpu Yang,Zirui Song,Yuanxi Wang,Zhexuan Cui,Yuke Li,Qian Jiang,Xiuying Chen
机构: MBZUAI(阿联酋穆巴达拉人工智能研究所); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:
Abstract:Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbfFineState-Bench, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textitFineState-Metrics, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textitVisual Diagnostic Assistant (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8% on Web and 22.8% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \hrefthis https URLGithub.
[CV-27] ClimateVID – Social Media Videos Analysis and Challenges Involved
【速读】:该论文旨在解决社交媒体平台上短视频内容中视觉主题自动检测的问题,尤其关注如何利用视觉语言模型(Visual Language Models, VLMs)和无监督聚类方法从海量非结构化图像帧中提取有意义的视觉模式。其解决方案的关键在于:(1) 通过零样本图像分类评估多个主流VLMs(如VideoChatGPT、PandaGPT和VideoLLaVA)在社会媒体数据上的表现,并与基于CLIP的逐帧分类基线进行对比;(2) 将聚类任务建模为最小成本多切分(minimum cost multicut)问题,从而在无需标签的情况下发现具有区分度的视觉帧簇。研究发现,尽管当前VLMs尚无法准确识别气候变化相关类别,但ConvNeXt V2与DINOv2两种图像嵌入模型仍能生成语义合理的聚类结果,其中DINOv2更侧重于风格差异与抽象类别,而ConvNeXt V2则呈现更细粒度的区分能力。
链接: https://arxiv.org/abs/2604.27968
作者: Shiqi Xu,Moritz Burmester,Katharina Prasse,Isaac Bravo,Stefanie Walter,Margret Keuper
机构: University of Mannheim; Max-Planck-Institute for Informatics, Saarland Informatics Campus; Technical University of Munich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contributions by Shiqi Xu and Moritz Burmester
Abstract:The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. %Given that VLMs are not currently capable to grasp the climate change discourse, we focus the clustering evaluation of image embedding models. We find that both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways. Code available at this https URL.
[CV-28] ripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
【速读】:该论文旨在解决视频虚拟试衣(Video Virtual Try-On)模型在真实场景下性能受限的问题,主要瓶颈在于缺乏大规模、多样化的野外(in-the-wild)三元组数据以及不恰当的服装掩码使用方式。解决方案的关键在于:首先构建了目前最大且最多样化的野外三元组数据集 TripVVT-10K,提供显式的视频级跨服装监督信号;其次提出基于 Diffusion Transformer 的框架 TripVVT,以稳定的人体掩码先验替代易失效的服装掩码,从而在复杂运动、遮挡和杂乱场景中保持背景一致性与时间稳定性;同时建立了涵盖多服装类型、复杂环境及多人场景的基准 TripVVT-Bench,实现对视频质量、试穿保真度、背景一致性和时序连贯性的全面评估。
链接: https://arxiv.org/abs/2604.27958
作者: Dingbao Shao,Song Wu,Shenyi Wang,Ye Wang,Ziheng Tang,Fei Liu,Jiang Lin,Xinyu Chen,Qian Wang,Ying Tai,Jian Yang,Zili Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to the scarcity of large-scale in-the-wild triplet data and the improper use of masks, the performance of video virtual try-on models remains limited. In this paper, we first introduce TripVVT-10K, the largest and most diverse in-the-wild triplet dataset to date, providing explicit video-level cross-garment supervision that existing video datasets lack. Built upon this resource, we develop TripVVT, a Diffusion Transformer-based framework that replaces fragile garment masks with a simple, stable human-mask prior, enabling reliable background preservation while remaining robust to real-world motion, occlusion, and cluttered scenes. To support comprehensive evaluation, we further establish TripVVT-Bench, a 100-case benchmark covering diverse garments, complex environments, and multi-person scenarios, with metrics spanning video quality, try-on fidelity, background consistency, and temporal coherence. Compared to state-of-the-art academic and commercial systems, TripVVT achieves superior video quality and garment fidelity while markedly improving generalization to challenging in-the-wild videos. We publicly release the dataset and benchmark, which we believe provide a solid foundation for advancing controllable, realistic, and temporally stable video virtual try-on.
[CV-29] GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
【速读】:该论文旨在解决当前图形用户界面(GUI)智能体在长时程信用分配、分布偏移以及不可逆环境中的安全探索等挑战,这些问题限制了监督微调方法的有效性。其核心解决方案在于引入强化学习(Reinforcement Learning, RL)作为关键驱动机制,并构建了一个系统的分类体系,将现有方法划分为离线RL、在线RL与混合策略三类;同时强调通过奖励工程优化、数据效率提升及关键技术突破来推动GUI智能体向数字居民(digital inhabitants)演进。其中,关键创新点包括:采用多层复合奖励架构以平衡可靠性与可扩展性、基于世界模型(world-model-based training)缓解GUI输入/输出延迟瓶颈,以及发现显式推理监督并非必要——当奖励信号足够丰富时,系统会自发涌现出类似System-2的深思型决策行为。
链接: https://arxiv.org/abs/2604.27955
作者: Junan Hu,Jian Liu,Jingxiang Lai,Jiarui Hu,Yiwei Sheng,Shuang Chen,Jian Li,Dazhao Du,Song Guo
机构: Shandong University (山东大学); The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University (香港大学); Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.
[CV-30] he Effects of Visual Priming on Cooperative Behavior in Vision-Language Models
【速读】:该论文旨在解决视觉输入对视觉-语言模型(Vision-Language Models, VLMs)决策行为的影响机制问题,特别是在协作场景下,如重复囚徒困境(Iterated Prisoner’s Dilemma, IPD)中,图像内容和颜色线索如何引导VLM的行为倾向。其解决方案的关键在于系统性地评估不同视觉刺激(如描绘友善与攻击性概念的图像及色彩编码奖励矩阵)对多款先进VLM决策模式的影响,并探索通过提示修改、思维链(Chain of Thought, CoT)推理和视觉token缩减等策略进行行为调控的有效性。研究发现,VLM的行为易受视觉输入影响,且不同模型在敏感性和缓解策略效果上存在差异,揭示了架构与训练差异对行为响应多样性的重要作用。
链接: https://arxiv.org/abs/2604.27953
作者: Kenneth J. K. Ong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs’ cooperative behavior using the Iterated Prisoner’s Dilemma (IPD) as a test scenario. We examine whether exposure to images depicting behavioral concepts (kindness/helpfulness vs. aggressiveness/selfishness) and color-coded reward matrices alters VLM decision patterns. Experiments were conducted across multiple state-of-the-art VLMs. We further explore mitigation strategies including prompt modifications, Chain of Thought (CoT) reasoning, and visual token reduction. Results show that VLM behavior can be influenced by both image content and color cues, with varying susceptibility and mitigation effectiveness across models. These findings not only underscore the importance of robust evaluation frameworks for VLM deployment in visually rich and safety-critical environments, but also highlight how architectural and training differences among models may lead to distinct behavioral responses-an area worthy of further investigation.
[CV-31] Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training
【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)预训练过程中因数据采样导致的长尾概念(long-tail concepts)表示不足问题。现有高效预训练方法虽能降低计算成本,但常过度移除稀有语义类别,使长尾概念在训练中难以被有效捕获。解决方案的关键在于提出一种动态聚类采样方法(DynamiCS),该方法在每个训练轮次(epoch)动态调整数据分布:对大规模语义簇进行下采样,对小规模簇进行上采样,从而保持语义簇间的相对顺序并强化长尾信息的代表性。此策略区别于以往仅追求语义分布平坦化的做法,实验证明其在显著降低计算开销的同时提升了长尾概念的性能表现。
链接: https://arxiv.org/abs/2604.27932
作者: Mingliang Liang,Zhuoran Liu,Arjen P. de Vries,Martha Larson
机构: Radboud University (奈梅亨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emphlong-tail concepts remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emphdynamic cluster-based sampling approach (DynamiCS) that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.
[CV-32] raining-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
【速读】:该论文旨在解决隧道检测中现有无训练基础模型(training-free foundation-model)方法仅能生成粗粒度开放词汇提议(open-vocabulary proposals),难以在干扰密集的隧道场景中直接用于缺陷定位与工程评估的问题。其解决方案的关键在于:不将语言引导的缺陷提议视为最终输出,而是在推理阶段通过密集视觉一致性(dense visual consistency)重新校准其空间支持,从而将粗略语义锚点转化为在隧道特定硬负样本下更可靠的提示;进一步地,将所得掩码重构为包含类别、位置、几何、严重程度和上下文属性的结构化缺陷实体,并在专家知识约束下映射至检索增强解释与工程可读报告生成,实现从粗略定位向结构化缺陷证据的跃迁。
链接: https://arxiv.org/abs/2604.27928
作者: Shipeng Liu,Liang Zhao,Dengfeng Chen,Zhanping Song
机构: XAUAT (西安科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at coarse open-vocabulary proposals, which are difficult to use directly in interference-heavy tunnel scenes. We propose a training-free framework TunnelMIND. Specifically, language-guided defect proposals are not treated as final outputs; instead, their spatial support is recalibrated at inference time through dense visual consistency, so that coarse semantic anchors can be transformed into more reliable prompts under tunnel-specific hard negatives. The resulting masks are further reconstructed into structured defect entities with category, location, geometry, severity, and context attributes, which are then mapped to retrieval-grounded explanation and engineering-readable report generation under expert knowledge constraints. On visible, GPR, and road defect tasks, TunnelMIND achieves F1 scores of 0.68, 0.78, and 0.72, respectively. Overall, TunnelMIND shows that training-free tunnel inspection can move beyond coarse localization toward structured defect evidence for engineering assessment.
[CV-33] Generate Your Talking Avatar from Video Reference
【速读】:该论文旨在解决现有说话头像(Talking Avatar)生成方法依赖单一场景静态参考图像所带来的局限性,此类方法因缺乏充分的时间动态信息和表情线索,难以在自定义背景中合成高保真度的说话头像。解决方案的关键在于提出一种全新的框架TAVR(Talking Avatar generation from Video Reference),其核心创新是引入跨场景视频输入以扩展时间上下文并缓解域间差异问题;具体包括:1)通过token选择模块有效处理长时视频内容;2)采用三阶段训练策略——同场景视频预训练建立基础外观复制能力,跨场景参考微调增强鲁棒性,任务特定强化学习优化身份一致性;3)构建包含158对精心设计的跨场景视频对的新基准用于系统评估。实验表明,TAVR在推理时支持灵活的视频参考,且在定量与定性指标上均显著优于现有基线方法。
链接: https://arxiv.org/abs/2604.27918
作者: Zujin Guo,Zhenhui Ye,Yi Ren,Yuanming Li,Ce Chen,Zhibin Hong,Chen Change Loy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \hrefthis https URLHeyGen Research and \hrefthis https URLHeyGen Avatar-V.
[CV-34] HiMix: Hierarchical Artifact-aware Mixup for Generalized Synthetic Image Detection
【速读】:该论文旨在解决合成图像检测(Synthetic Image Detection, SID)中现有检测模型泛化能力差的问题,特别是当训练数据受限且存在偏差时,模型难以识别未见过的生成器所产的伪造图像。解决方案的关键在于提出一种统一框架HiMix,其核心创新包括两个模块:一是基于Mixup的分布增强(Mixup-driven Distributional Augmentation, MDA),通过构建真实与虚假图像之间的连续过渡样本,扩展训练分布并提升对低置信度区域的覆盖,同时利用像素级Mixup平滑扰动语义信息以增强对低层伪造痕迹的敏感性;二是分层artifact感知表示(Hierarchical Artifact-aware Representation, HAR),通过跨层融合与粗到细的特征聚合,从全局和局部层面提取判别性伪造表征,从而在多样分布下实现更鲁棒的检测性能。
链接: https://arxiv.org/abs/2604.27903
作者: Shuchang Zhou,Kaiwen Shen,Jiwei Wei,Yuyang Zhou,Peng Wang,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid evolution of generative models has enabled the creation of highly realistic and diverse synthetic images, posing significant challenges to reliable and generalizable Synthetic Image Detection (SID). However, existing detectors are typically trained on limited and biased datasets, resulting in poor generalization to unseen generators. To address this issue, we propose HiMix, a unified framework that enhances generalization by expanding the training distribution and promoting artifact-aware representations. Specifically, the Mixup-driven Distributional Augmentation (MDA) module constructs continuous transitional samples between real and fake images, improving coverage of low-confidence regions and exposing the model to more challenging samples, while the pixel-wise mixup operation smoothly perturbs semantics to enhance sensitivity to low-level artifacts. Moreover, the Hierarchical Artifact-aware Representation (HAR) module aggregates artifact information from both global and local levels through cross-layer integration and coarse-to-fine feature fusion, enabling the extraction of discriminative forgery representations under diverse distributions. Extensive experiments across multiple benchmarks demonstrate that HiMix achieves state-of-the-art performance, establishing well-separated logits for improved generalization to unseen forgeries.
[CV-35] Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection
【速读】:该论文旨在解决遥感图像中语义分割(Semantic Segmentation, SS)与变化检测(Change Detection, CD)任务中存在的挑战,包括模型对时间不一致性敏感、难以捕捉细粒度空间结构、预训练成本高以及可解释性差等问题。其解决方案的关键在于提出Noise2Map框架,该框架基于扩散模型的去噪过程,通过设计任务特定的噪声调度策略(noise schedule)和时间步条件控制(timestep conditioning),直接预测语义图或变化图,从而避免传统扩散模型昂贵的采样步骤;同时采用自监督去噪预训练与监督微调相结合的方式,在共享主干网络基础上实现多任务学习,显著提升了模型在真实遥感场景下的性能、鲁棒性和可解释性。
链接: https://arxiv.org/abs/2604.27889
作者: Ali Shibli,Andrea Nascetti,Yifang Ban
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation and change detection are two fundamental challenges in remote sensing, requiring models to capture either spatial semantics or temporal differences from satellite imagery. Existing deep learning models often struggle with temporal inconsistencies or in capturing fine-grained spatial structures, require extensive pretraining, and offer limited interpretability - especially in real-world remote sensing scenarios. Recent advances in diffusion models show that Gaussian noise can be systematically leveraged to learn expressive data representations through denoising. Motivated by this, we investigate whether the noise process in diffusion models can be effectively utilized for discriminative tasks. We propose Noise2Map, a unified diffusion-based framework that repurposes the denoising process for fast, end-to-end discriminative learning. Unlike prior work that uses diffusion only for generation or feature extraction, Noise2Map directly predicts semantic or change maps using task-specific noise schedules and timestep conditioning, avoiding the costly sampling procedures of traditional diffusion models. The model is pretrained via self-supervised denoising and fine-tuned with supervision, enabling both interpretability and robustness. Our architecture supports both tasks (SS and CD) through a shared backbone and task-specific noise schedulers. Extensive evaluations on the SpaceNet7, WHU, and xView2 buildings damaged by wildfires datasets demonstrate that Noise2Map ranks on average 1st among seven models on semantic segmentation and 1st on change detection by a cross-dataset rank metric (average F1 primary, IoU tie-break). Ablation studies highlight the robustness of our model against different training noise schedulers and timestep control in the diffusion process, as well as the ability of the model to perform multi-task learning.
[CV-36] Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像的检测模型在面对未见过的生成模型时泛化能力不足的问题。现有方法虽融合了视觉基础模型(Vision Foundation Models, VFMs)的语义信息与基于频率的方法所提取的伪影特征,但仍因两个关键因素导致性能下降:一是频率上的“捷径偏置”(frequency shortcut bias),即模型过度依赖特定生成器易区分的低级特征;二是高层语义与低频模式之间的跨域表示冲突。解决方案的核心在于提出一种频率感知门控注入网络(Frequency-aware Gated Injection Network, FGINet),其包含三个关键技术:1)带频带掩码的频率编码器(Band-Masked Frequency Encoder, BMFE),通过跨频带掩码减少对生成器特异性模式的依赖;2)逐层门控频率注入机制(Layer-wise Gated Frequency Injection, LGFI),自适应地将频率线索逐步注入 VFM 主干网络,缓解层级抽象过程中的表示冲突;3)超球面紧凑性学习框架(Hyperspherical Compactness Learning, HCL),采用余弦边界目标优化特征分布,增强类别内紧凑性和类间分离度。实验表明,FGINet 在多个挑战性数据集上实现了最先进的检测性能和强泛化能力。
链接: https://arxiv.org/abs/2604.27875
作者: Shuchang Zhou,Shangkun Wu,Jiwei Wei,Ke Liu,Ran Ran,Caiyan Qin,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-generated images are becoming increasingly realistic and diverse, posing significant challenges for generalizable detection. While Vision Foundation Models (VFMs) provide rich semantic representations and frequency-based methods capture complementary artifact cues, existing approaches that combine these modalities still suffer from limited generalization, with notable performance degradation on unseen generative models. We attribute this limitation to two key factors: frequency shortcut bias toward easily distinguishable cues associated with specific generators and cross-domain representation conflict between high-level semantics and low-level frequency patterns. To address these issues, we propose a Frequency-aware Gated Injection Network (FGINet) to improve generalization. Specifically, we design a Band-Masked Frequency Encoder (BMFE) that applies cross-band masking in the frequency domain to reduce reliance on generator-specific patterns and encourage more diverse and generalizable representations. We further introduce a Layer-wise Gated Frequency Injection (LGFI) mechanism to progressively inject frequency cues into the VFM backbone with adaptive gating, aligning with its hierarchical abstraction and alleviating representation conflict. Moreover, we propose a Hyperspherical Compactness Learning (HCL) framework with a cosine margin objective to learn compact and well-separated representations. Extensive experiments demonstrate that FGINet achieves state-of-the-art performance and strong generalization across multiple challenging datasets.
[CV-37] Parameter-Efficient Architectural Modifications for Translation-Invariant CNNs
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在实际应用中对空间位移敏感的问题,即尽管CNN通常被认为具有平移不变性(translation-invariance),但其标准架构由于依赖空间相关的全连接层,在仅发生单像素位移时仍会导致性能显著下降。解决方案的关键在于提出一种轻量级的“在线架构”(Online Architecture)策略:通过在不同网络深度处战略性地插入全局平均池化(Global Average Pooling, GAP)层,有效将特征识别与空间位置解耦。这一修改不仅大幅减少模型参数(VGG-16从5.2M降至82K)和整体网络规模(138M降至14M),还在ImageNet上保持了高Top-1准确率(66.4%),同时将翻译鲁棒性提升一倍,平均相对损失从0.09降至0.05。
链接: https://arxiv.org/abs/2604.27870
作者: Nuria Alabau-Bosque,Jorge Vila-Tomas,Paula Dauden-Oliver,Valero Laparra,Jesus Malo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 16 figures
Abstract:Convolutional Neural Networks (CNNs) are widely assumed to be translation-invariant, yet standard architectures exhibit a startling fragility: even a single-pixel shift can drastically degrade performance due to their reliance on spatially dependent fully connected layers. In this work, we resolve this vulnerability by proposing a lightweight ‘Online Architecture’ strategy. By strategically inserting Global Average Pooling (GAP) layers at various network depths, we effectively decouple feature recognition from spatial location. Using VGG-16 as a primary case study, we demonstrate that this architectural modification achieves a massive 98% reduction in trainable parameters (from 5.2M to just 82K) and a 90% reduction in total network size (138M to 14M). Despite this drastic pruning, our variants maintain competitive Top-1 accuracy on ImageNet (66.4%) while doubling translational robustness, reducing average relative loss from 0.09 to 0.05. Furthermore, our analysis identifies a fundamental limit to invariance: while GAP resolves macroscopic sensitivity, discrete pooling operations introduce a residual periodic aliasing that prevents perfect pixel-level stability. Finally, we extend these findings to Perceptual Image Quality Assessment (IQA) by integrating our invariant backbones into the LPIPS framework. The resulting metric significantly outperforms the retrained baseline in generalization across the KADID-10k dataset (Spearman 0.89 vs. 0.75) and achieves a near-perfect alignment with human psychophysical response curves on the RAID dataset (Spearman 0.95). These results confirm that enforcing architectural invariance is a far more efficient and biologically plausible path to robustness than traditional data augmentation. Data and code are publicly available. The data and code are publicly available to facilitate validation and further research.
[CV-38] aming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning CVPR2026
【速读】:该论文旨在解决原型感知的个性化联邦学习(Prototype-based Personalized Federated Learning, ProtoPFL)中因直接共享类原型(class prototypes)所引发的隐私泄露问题。现有方法通常采用ℓ₂裁剪结合各向同性高斯噪声注入(Isotropic Gaussian Prototype Perturbation, IGPP)以实现局部差分隐私(Local Differential Privacy, LDP),但此类方法存在过度扰动判别性强维度、难以在裁剪阈值与表征保真度之间取得平衡的问题。解决方案的关键在于提出VPDR(Variance-adaptive Prototype Perturbation with Distillation-guided Clipping Regularization),其包含两个核心机制:一是基于维度级类方差自适应分配噪声的变方差原型扰动(Variance-adaptive Prototype Perturbation, VPP),通过减少对判别子空间的扰动来保留语义可分性;二是蒸馏引导的裁剪正则化(Distillation-guided Clipping Regularization, DCR),使特征范数自适应地集中在预设裁剪阈值附近,同时保持预测一致性。理论分析表明,VPDR在相同隐私约束下提供不低于基线的隐私保障,实验验证其在多域基准上显著优于IGPP,在隐私-效用权衡上表现更优且具备抗现实攻击鲁棒性。
链接: https://arxiv.org/abs/2604.27833
作者: Yuhua Wang,Qinnan Zhang,Xiaodong Li,Huan Zhang,Yifan Sun,Wangjie Qiu,Hainan Zhang,Yongxin Tong,Zhiming Zheng
机构: Beihang University (北京航空航天大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026 (Highlight)
Abstract:Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient multi-domain adaptation by communicating compact class prototypes, but directly sharing them poses privacy risks. A common defense involves per-example \ell_2 clipping before prototype computation to bound sensitivity, followed by isotropic Gaussian noise to enforce Local Differential Privacy (LDP). However, Isotropic Gaussian Prototype Perturbation (IGPP) typically over-perturbs discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. In this paper, we propose VPDR, a client-side privacy plug-in that seamlessly integrates into existing ProtoPFLs. Motivated by the observation that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which allocates less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further develop Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise mechanism provides privacy guarantees no weaker than the isotropic baseline under the same privacy constraints. Extensive experiments on multi-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning without sacrificing robustness against realistic attacks.
[CV-39] Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在训练过程中使用用户私有数据后,当用户请求删除其数据时,如何实现高效且不需完全重新训练即可从模型中移除特定类别的数据的问题。解决方案的关键在于提出一种改进的 SISA(Sharded, Isolated, Sliced, and Aggregated)框架,该框架通过引入强化回放机制(reinforced replay mechanism)和门控网络(gating network),实现了对 CNN 架构中特定类别数据的定向遗忘(class-level unlearning),从而在保持模型整体性能的同时显著降低重训练开销。
链接: https://arxiv.org/abs/2604.27804
作者: Ishrak Hamim Mahi,Siam Ferdous,Md Sakib Sadman Badhon,Nabid Hasan Omi,Md Habibun Nabi Hemel,Farig Yousuf Sadeque,Md. Tanzim Reza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 10 pages, 9 figures, 2 tables
Abstract:The rapid proliferation of image generation models and other artificial intelligence (AI) systems has intensified concerns regarding data privacy and user consent. As the availability of public datasets declines, major technology companies increasingly rely on proprietary or private user data for model training, raising ethical and legal challenges when users request the deletion of their data after it has influenced a trained model. Machine unlearning seeks to address this issue by enabling the removal of specific data from models without complete retraining. This study investigates a modified SISA (Sharded, Isolated, Sliced, and Aggregated) framework designed to achieve class-level unlearning in Convolutional Neural Network (CNN) architectures. The proposed framework incorporates a reinforced replay mechanism and a gating network to enhance selective forgetting efficiency. Experimental evaluations across multiple image datasets and CNN configurations demonstrate that the modified SISA approach enables effective class unlearning while preserving model performance and reducing retraining overhead. The findings highlight the potential of SISA-based unlearning for deployment in privacy-sensitive AI applications. The implementation is publicly available at this https URL sisa-class-unlearning.
[CV-40] GourNet: A CNN-Based Model for Mango Leaf Disease Detection
【速读】:该论文旨在解决芒果叶片疾病早期精准识别的问题,以提升病害防控效率并保障芒果产量与品质。其解决方案的关键在于提出了一种名为GourNet的深度学习模型,该模型基于卷积神经网络(Convolutional Neural Networks, CNN)架构,利用包含八个类别的MangoLeafBD(MBD)数据集进行训练与验证。通过图像预处理(如尺寸调整、归一化和数据增强)优化输入质量,并采用80%训练、10%验证和10%测试的数据划分策略确保评估可靠性。最终模型仅使用683,656个参数即实现了97%的分类准确率,体现了轻量化设计与高精度识别的有机结合。
链接: https://arxiv.org/abs/2604.27764
作者: Ekram Alam,Jaydip Sanyal,Akhil Kumar Das,Arijit Bhattacharya,Farhana Sultana
机构: Gour Mahavidyalaya(古尔学院); University of Gour Banga(古尔邦加大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mango cultivation is crucial in the agricultural sector, significantly contributing to economic development and food security. However, diseases affecting mango leaves can significantly reduce both the production and overall fruit grade. Detecting leaf diseases at an early stage with precision is key to effective disease prevention and sustaining crop productivity. In this paper, we introduce a “deep learning” model named “GourNet”, which leverages “Convolutional Neural Networks” to identify infections in mango leaves. We utilize the “MangoLeafBD” (MBD) dataset to train and assess the effectiveness of the presented model. The MBD dataset contains seven disease classes and a Healthy class, making a total of eight classes. To enhance model performance, the images are preprocessed through steps like resizing, rescaling, and data augmentation prior to training. To properly evaluate the model, the dataset is separated into 80% for training, with the remaining 20% equally split between validation and testing. Our model uses only 683,656 total parameters and achieves a classification accuracy of 97%. This research’s source code can be found at: this https URL.
[CV-41] Learning to Reason : Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition CVPR
【速读】:该论文旨在解决如何在深度神经网络中有效整合领域知识以提升模型泛化能力的问题,尤其针对现实视觉任务中难以获取显式符号规则的挑战。其解决方案的关键在于提出一种可微的知识单元(Differentiable Knowledge Unit, DKU),该单元通过隐式学习到的概念与任务类别之间的逻辑关系(基于蕴含规则)来调整分类器的logits,从而生成更精确的类别概率。DKU利用模糊推理计算逻辑调整向量,并通过反向传播优化主任务监督信号,间接训练概念分类器,实现无需标签的概念识别与知识集成,同时保证概念间的区分性和与类别的双向逻辑关联,从而构建清晰的监督信号促进概念学习。
链接: https://arxiv.org/abs/2604.27759
作者: Gurucharan Srinivas,Joshua Niemeijer,Frank Köster
机构: German Aerospace Center (DLR)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR Findings 2026
Abstract:Integrating domain knowledge into deep neural networks is a promising way to improve generalization. Existing methods either encode prior knowledge in the loss function or apply post-processing modules, but both depend on identifying useful symbolic knowledge to integrate. Since such rules are often unavailable in real-world vision tasks, we propose a method for targeted knowledge discovery. We propose a Differentiable Knowledge Unit (DKU) that enables modulating the classifier logits, yielding refined class probabilities. The DKU uses implication rules to represent relationships between task classes and implicit concepts learned entirely from the main task supervision, without requiring concept labels. Concepts are identified by dedicated classifiers, whose probabilities are passed to DKU alongside the primary class probabilities. DKU computes a logic-based adjustment vector via fuzzy inference, which modulates the primary class logits to yield refined class probabilities. When concept classifiers represent concepts that do not support the logical rule structure, the resulting adjustments to the class probabilities do not directly minimize the supervision loss. Consequently, optimizing the supervision loss on these adjusted class probabilities implicitly trains the concept classifiers. We construct the rule base so that bidirectional logical relations connect concepts and classes. We enforce the concepts to be distinct from each other and with respect to the classes. This design enforces a clean supervision signal for concept learning. We evaluate our methods on the PASCAL-VOC, COCO, and MedMNIST datasets. We demonstrate improvement through our knowledge integration across these datasets. We conduct domain generalization and hard-sample ablation studies and find that our implicit knowledge discovery and integration outperforms the baseline.
[CV-42] Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining CVPR2026
【速读】:该论文旨在解决测试时提示调优(Test-time Prompt Tuning, TPT)中模型校准性能差的问题,即TPT在使用无标签测试数据优化文本提示后,常导致预测置信度不可靠。研究表明,现有改进方法通过引入正则化项约束模型输出虽能提升校准效果,但往往牺牲了任务性能。作者发现这些正则化策略本质上会引导优化过程趋向于更平坦的损失景观(loss landscape)区域,而提示适应后的损失曲率(sharpness)是决定校准质量的关键因素。解决方案的核心在于提出一种无需标注数据且无额外计算开销的预训练框架——平坦性感知提示预训练(Flatness-aware Prompt Pretraining, FPP),其关键思想是在适应前将提示初始化于损失景观中更平坦的区域,从而显著提升校准性和性能,且仅需替换现有TPT流程中的提示初始化方式即可实现。
链接: https://arxiv.org/abs/2604.27715
作者: Hyeonseo Jang,Jaebyeong Jeon,Joong-Won Hwang,Kibok Lee
机构: Yonsei University (延世大学); ETRI (电子与电信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines–without modifying any other components–is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: this https URL.
[CV-43] A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images
【速读】:该论文旨在解决深度学习模型在遥感图像语义分割任务中因预训练数据与目标域之间存在显著领域差异(domain gap)而导致的性能瓶颈问题,尤其是在使用ImageNet等通用图像数据集进行预训练时,模型容易习得与遥感场景无关的特征,从而限制其在特定遥感场景下的泛化能力。解决方案的关键在于提出一种新颖且简单的预训练策略,通过引导模型在预训练阶段避免学习预训练数据集中存在的领域特异性特征(domain-specific features),从而增强预训练模型的跨域泛化能力。实验表明,该策略在多个具有不同场景和模态的遥感数据集(iSAID、MFNet、PST900、Potsdam)上均实现了最先进的分割精度,验证了其有效性与通用性。
链接: https://arxiv.org/abs/2604.27704
作者: Yuan Fang,Yuanzhi Cai,Jagannath Aryal,Qinfeng Zhu,Hong Huang,Cheng Zhang,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the segmentation of remotely sensed images, deep learning models are typically pre-trained using large image databases like ImageNet before fine-tuned on domain-specific datasets. However, the performance of these fine-tuned models is often hindered by the large domain gaps (i.e., differences in scenes and modalities) between ImageNet’s images and remotely sensed images being processed. Therefore, many researchers have undertaken efforts to establish large-scale domain-specific image datasets for pre-training, aiming to enhance model performance. However, establishing such datasets is often challenging, requiring significant effort, and these datasets often exhibit limited generaliza-bility to other application scenarios. To address these issues, this study introduces a novel yet simple pre-training strategy designed to guide a model away from learning domain-specific features in a pre-training dataset during pre-training, thereby improving the generalisation ability of the pre-trained model. To evaluate the strategy’s effectiveness, deep learning models are pre-trained on ImageNet and subsequently fine-tuned on four semantic segmentation datasets with diverse scenes and modalities, including iSAID, MFNet, PST900 and Potsdam. Experimental results show that the proposed pre-training strategy led to state-of-the-art accuracies on all four datasets, namely 67.4% mIoU for iSAID, 56.9% mIoU for MFNet, 84.22% mIoU for PST900, 91.88% mF1 for Potsdam. This research lays the groundwork for developing a unified foundation model applicable to both computer vision and remote sensing applications.
[CV-44] RayFormer: Modeling Inter- and Intra-Ray Similarity for NeRF-Based Video Snapshot Compressive Imaging
【速读】:该论文旨在解决视频快照压缩感知(Video Snapshot Compressive Imaging, SCI)中因传统随机射线采样策略导致的内容结构信息丢失,进而限制重建质量的问题。其关键解决方案在于提出一种分块级射线采样策略(patch-level ray sampling),以显式建模场景内容的空间结构,并进一步设计了跨射线与射线内注意力机制的Transformer架构——RayFormer,用于捕捉空间邻近点间的跨射线结构相似性以及沿视图射线方向相邻点的内部相关性;同时,借助分块采样特性将总变差先验(total variation prior)引入优化目标函数,从而提升空间平滑性并抑制伪影,最终实现优于现有方法的重建性能。
链接: https://arxiv.org/abs/2604.27702
作者: Yubo Dong,Danhua Liu,Anqi Li,Zhenyuan Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video snapshot compressive imaging (SCI) enables the reconstruction of dynamic scenes from a single snapshot measurement. Recently, NeRF-based methods have shown promising reconstruction performance. However, such methods typically adopt random ray sampling strategies and fail to capture content structural similarities, resulting in limited reconstruction quality. To address these issues, we first propose a patch-level ray sampling strategy to enable the modeling of content structure. Then, we propose an Inter- and Intra-Ray Transformer (RayFormer) to capture the structural similarities, modeling both inter-ray similarities among spatially neighboring points at the same depth and intra-ray correlations between adjacent points along the viewing ray. Finally, benefiting from the patch-level sampling strategy, the total variation prior is incorporated into the objective function to enhance spatial smoothness and suppress artifacts. Experiments in both simulated and real-world scenes demonstrate that the proposed method achieves state-of-the-art (SOTA) reconstruction performance.
[CV-45] Deep Learning-Based Segmentation of Peritoneal Cancer Index Regions from CT Imaging
【速读】:该论文旨在解决腹膜转移瘤的评估中缺乏标准化影像学替代指标的问题,当前临床依赖有创的诊断性腹腔镜检查来计算Sugarbaker腹膜癌指数(sPCI),而其影像学对应指标——放射学腹膜癌指数(rPCI)虽已提出,但尚无自动化分割方法支持其广泛应用。解决方案的关键在于利用深度学习模型实现对rPCI标准解剖区域的自动分割,研究对比了nnU-Net与Swin UNETR在62例CT图像上的表现,结果显示nnU-Net整体Dice相似系数达0.82,接近人与人之间的评分一致性(0.88),显著优于Swin UNETR(0.76),验证了基于深度学习的自动化rPCI分割在非侵入性影像评估中的可行性。
链接: https://arxiv.org/abs/2604.27697
作者: Pieter C. Gort,Lotte J.S. Ewals,Marion W. Tops-Welten,Cris H.B. Claessens,Joost Nederend,Fons van der Sommen
机构: Eindhoven University of Technology (埃因霍温理工大学); Catharina Ziekenhuis (卡塔里娜医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at Computer Assisted Radiology and Surgery (CARS) 2026
Abstract:Peritoneal metastases are currently assessed using diagnostic laparoscopy to determine Sugarbaker’s Peritoneal Cancer Index (sPCI), which works by dividing the abdomen into 13 regions and scoring each region based on tumor size. A recent consensus study defined 3D regions to facilitate a radiological PCI (rPCI), providing standardized anatomical regions for imaging-based assessment. Despite its clinical value, sPCI is invasive and lacks a standardized imaging counterpart. In this study, we propose a deep learning-based approach to automatically segment the rPCI regions on CT. We evaluate nnU-Net and Swin UNETR on 62 CT scans with rPCI regions manually annotated by three clinical researchers and validated by two expert radiologists. Performance was assessed using five-fold cross-validation with the Dice Similarity Coefficient (Dice), 95th percentile Hausdorff distance and Average Surface Distance. nnU-Net achieved an overall Dice of 0.82, approaching interobserver agreement (0.88) and outperforming Swin UNETR (0.76), with remaining challenges primarily in right flank and small-bowel regions. These results demonstrate feasibility of automated rPCI segmentation, laying the foundation for non-invasive, imaging-based assessment.
[CV-46] MSR:Hybrid Field Modeling for CT-MRI Rigid-Deformable Registration of the Cervical Spine with an Annotated Dataset
【速读】:该论文旨在解决颈椎区域CT与MRI图像之间精确配准的问题,这一问题在术前规划中至关重要,因颈椎结构解剖复杂、个体差异大且易受椎动脉和脊髓损伤影响。现有方法对刚性-非刚性混合建模的研究不足,且高质量多模态标注数据稀缺限制了进展。解决方案的关键在于提出一种名为MSR的刚性-非刚性混合配准框架,并构建了一个全面标注的CT-MRI数据集R-D-Reg。MSR包含两个核心模块:一是用于单个椎体独立刚性对齐的刚性配准模块;二是基于Mamba的全局建模与Swin Transformer的局部建模相结合的可变形配准模块(通过自适应门控机制融合),最终将刚性与可变形形变场融合生成更优的混合形变场,从而更好地保持局部解剖一致性。
链接: https://arxiv.org/abs/2604.27654
作者: Bohai Zhang,Wenjie Chen,Mu Li,Kaixing Long,Xing Shen,Xinqiang Yao,Jincheng Yang,Jianting Chen,Wei Yang,Qianjin Feng,Lei Cao
机构: Southern Medical University (南方医科大学); Nanfang Hospital (南方医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate CT-MRI registration of the cervical spine is essential for preoperative planning because this region is anatomically complex,highly variable,and vulnerable to injury of the vertebral arteries and spinal cord. However,cervical CT-MRI registration remains underexplored,particularly for rigid-deformable hybrid modeling,and the lack of high-quality annotated multimodal data further limits progress. To address these challenges, we construct and release a comprehensively annotated CT-MRI dataset, R-D-Reg, and propose MSR, a rigid-deformable hybrid registration framework for complex joint structures. Specifically, MSR includes a rigid registration module for independent local rigid alignment of individual vertebrae and a deformable registration module with an MSL block that combines Mamba-based global modeling and Swin Transformer-based local modeling through adaptive gating. The rigid and deformable deformation fields are then fused to generate a hybrid field that better preserves local anatomical consistency. The code and dataset are publicly available at this https URL.
[CV-47] FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging
【速读】:该论文旨在解决传统推扫式高光谱成像(push-broom hyperspectral imaging)因采集速度慢而难以实现实时目标检测的问题,同时克服快照式光谱成像(snapshot spectral imaging)在捕获后需耗时重建导致的性能瓶颈。其解决方案的关键在于提出一种端到端的Focal U-shaped Network(FUN),通过多任务学习联合完成高光谱图像(HSI)重建与目标检测任务:共享U形主干网络中,重建提供底层光谱信息,检测引导语义感知先验学习,形成任务间互惠机制;更重要的是,引入焦点调制(focal modulation)作为自注意力机制的高效替代方案,在降低计算复杂度至线性级别的同时,实现空间与光谱特征的动态调制,从而构建无需自注意力机制的轻量化架构,显著提升实时边缘部署潜力。
链接: https://arxiv.org/abs/2604.27653
作者: Dahua Gao,Yubo Dong,Anqi Li,Zhenyuan Lin,Ang Gao,Danhua Liu,Guangming Shi
机构: School of Artificial Intelligence, Xidian University (西安电子科技大学人工智能学院); Changzhi Medical College (长治医学院); Engineering Research Centre for Intelligent Data Assisted Diagnosis and Treatment in Shanxi Province (山西省智能数据辅助诊疗工程研究中心); Uniwave Artificial Intelligence Technology Co., Ltd. (Uniwave人工智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: First work on exploring high-level computer vision tasks in compressive spectral imaging
Abstract:Conventional push-broom hyperspectral imaging suffers from slow acquisition speeds, precluding real-time object detection; in contrast, snapshot spectral imaging enables instantaneous hyperspectral images (HSIs) capture, making real-time object detection feasible, yet its potential is often compromised by time-consuming post-capture reconstruction. To address this issue, we propose the Focal U-shaped Network (FUN), a novel end-to-end framework that jointly performs HSI reconstruction and object detection via multi-task learning. FUN employs a shared U-shaped backbone, where reconstruction provides underlying spectral information while detection guides semantic-aware priors learning, facilitating mutually beneficial task interaction. Crucially, we introduce focal modulation, an efficient alternative to self-attention that modulates spatial and spectral features while reducing quadratic computational complexity, enabling a self-attention-free architecture for joint reconstruction and detection. Furthermore, we contribute a new HSI object detection dataset with 8712 annotated objects across 363 HSIs to facilitate evaluation of the proposed method. Experiments demonstrate that FUN achieves state-of-the-art performance on both tasks, using 40% fewer parameters and 30% less computation than recent alternatives, making it promising for future real-time edge deployment. The code and datasets are available: this https URL.
[CV-48] Robot Learning from Human Videos: A Survey
【速读】:该论文旨在解决具身人工智能(Embodied AI)与机器人领域中因数据规模受限而难以实现技能扩展的关键瓶颈问题。其核心解决方案在于利用人类活动视频数据作为学习资源,通过从大量可获取的人类示范中被动地提取和迁移操作技能,从而推动通用型机器人系统的可扩展学习。关键创新点在于构建了一个分层的分类体系,将人类视频转化为机器人技能的路径划分为任务导向、观测导向和动作导向三种方式,并系统分析了不同数据配置与学习范式之间的耦合关系,同时梳理了常用的人类视频数据集及生成方法的发展趋势,为未来研究提供了清晰的方向指引。
链接: https://arxiv.org/abs/2604.27621
作者: Junyi Ma,Erhang Zhang,Haoran Yang,Ditao Li,Chenyang Xu,Guangming Wang,Hesheng Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper list: this https URL
Abstract:A critical bottleneck hindering further advancement in embodied AI and robotics is the challenge of scaling robot data. To address this, the field of learning robot manipulation skills from human video data has attracted rapidly growing attention in recent years, driven by the abundance of human activity videos and advances in computer vision. This line of research promises to enable robots to acquire skills passively from the vast and readily available resource of human demonstrations, substantially favoring scalable learning for generalist robotic systems. Therefore, we present this survey to provide a comprehensive and up-to-date review of human-video-based learning techniques in robotics, focusing on both human-robot skill transfer and data foundations. We first review the policy learning foundations in robotics, and then describe the fundamental interfaces to incorporate human videos. Subsequently, we introduce a hierarchical taxonomy of transferring human videos to robot skills, covering task-, observation-, and action-oriented pathways, along with a cross-family analysis of their couplings with different data configurations and learning paradigms. In addition, we investigate the data foundations including widely-used human video datasets and video generation schemes, and provide large-scale statistical trends in dataset development and utilization. Ultimately, we emphasize the challenges and limitations intrinsic to this field, and delineate potential avenues for future research. The paper list of our survey is available at this https URL.
[CV-49] SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation ACM-MM2026
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中如何有效利用视觉语言模型(Vision-Language Models, VLMs)来增强代理在未见3D环境中的动态空间感知能力的问题。现有方法往往缺乏对导航动作因果推理(backward action reasoning,即“为什么执行该动作”)与未来状态预测(forward transition prediction,即“下一步会发生什么”)的联合建模,导致模型难以在复杂环境中进行长程推理和精准定位。解决方案的关键在于提出SpaAct训练框架,通过引入两个轻量级的空间激活任务——动作回溯(Action Retrospection)和未来帧选择(Future Frame Selection),分别监督模型对历史视觉变化反推动作序列的能力以及基于历史和当前动作预测下一帧视觉变化的能力,从而在VLM友好方式下激发其动态空间意识;同时设计TriPA(Tri-factor Progressive Adaptive)课程学习策略,按难度渐进式组织训练样本,使模型从基础移动技能逐步过渡到长视野导航推理,显著提升性能并达到当前最优水平。
链接: https://arxiv.org/abs/2604.27620
作者: Pengna Li,Kangyi Wu,Shaoqing Xu,Fang Li,Hanbing Li,Lin Zhao,Kailin Lyu,Long Chen,Zhi-Xin Yang,Nanning Zheng
机构: Xi’an Jiaotong University (西安交通大学); University of Macau (澳门大学); Xiaomi EV (小米汽车); Beijing Institute of Technology (北京理工大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submmited to ACM MM 2026
Abstract:Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward transition prediction, encouraging the model to build dynamic spatial awareness in a VLM-friendly way. To further stabilize adaptation, we design TriPA, a Tri-factor Progressive Adaptive curriculum learning method that organizes training samples from easy to hard, allowing the model to gradually acquire navigation skills from basic locomotion to long-horizon reasoning. Experiments on standard VLN-CE benchmarks show that SpaAct consistently improves VLM-based navigation and achieves state-of-the-art performance. We will release the code and models to support future research.
[CV-50] Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection
【速读】:该论文旨在解决无人机(UAV)在桥梁结构健康监测中进行裂缝自动检测时面临的四大挑战:裂缝特征微弱、成像条件退化、类别严重不平衡以及实际巡检流程中计算资源受限的问题。其解决方案的关键在于提出一个统一的轻量化卷积神经网络框架,包含四个协同工作的组件:轻量级主干网络、用于通道与空间增强的卷积块注意力模块(CBAM)、基于巡检场景先验的定向鲁棒数据增强策略,以及针对类别不平衡问题的Focal Loss损失函数。该设计在保持高推理速度(825 FPS)和低参数量(11.21M)的同时显著提升了模型性能,F1分数和召回率分别提升2.51%和3.95%,并通过Grad-CAM可视化验证了注意力机制有效聚焦于裂缝轨迹而非散乱区域,实现了精度、速度与鲁棒性的良好平衡,适用于地面站辅助的实时部署场景。
链接: https://arxiv.org/abs/2604.27617
作者: Wei Li,Haisheng Li,Weijie Li,Jiandong Wang,Kaichen Ma,Luming Yang
机构: Guangdong Provincial Highway Construction Co., Ltd. (广东省公路建设有限公司); Guangdong AIHISUN Technology Co., Ltd. (广东艾希Sun科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model’s focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: this https URL .
[CV-51] ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data ICPR2026
【速读】:该论文旨在解决遥感与环境科学中表格数据(tabular data)表示学习的挑战,包括特征异质性、标签稀缺以及特征冗余等问题。其核心解决方案是提出ZAYAN(Zero-Anchor dYnamic feAture eNcoding),一个以特征为中心的自监督对比学习框架,关键在于在特征层面而非样本层面进行对比学习,从而避免显式锚点选择和对类别标签的依赖,同时通过动态扰动与掩码机制促进冗余最小化和解耦嵌入空间的构建。该方法由两个模块组成:ZAYAN-CL用于预训练特征嵌入,ZAYAN-T则基于这些嵌入进行下游分类任务,实验证明其在标签稀缺和分布偏移场景下均表现出更优的准确性、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2604.27606
作者: Al Zadid Sultan Bin Habib,Tanpia Tasnim,Md. Ekramul Islam,Muntasir Tabasum
机构: West Virginia University (西弗吉尼亚大学); Green University of Bangladesh (孟加拉国绿色大学); Stamford University Bangladesh (孟加拉国斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 28th International Conference on Pattern Recognition (ICPR 2026) at Lyon, France. Code available at this https URL . PyPI package: pip install zayan
Abstract:Learning informative representations from tabular data in remote sensing and environmental science is challenging due to heterogeneity, scarce labels, and redundancy among features. We present ZAYAN (Zero-Anchor dYnamic feAture eNcoding), a self-supervised, feature-centric contrastive framework for tabular data. ZAYAN performs contrastive learning at the feature rather than sample level, removing the need for explicit anchor selection and any reliance on class labels, while encouraging a redundancy-minimized, disentangled embedding space. The framework has two modules: ZAYAN-CL, which pretrains feature embeddings via a zero-anchor contrastive objective with dynamic perturbations and masking, and ZAYAN-T, a Transformer that conditions on these embeddings for downstream classification. Across eight datasets, including six remote-sensing tabular benchmarks and two remote-sensing-driven flood-prediction tables from satellite and GIS products, ZAYAN achieves superior accuracy, robustness, and generalization over tabular deep learning baselines, with consistent gains under label scarcity and distribution shift. These results indicate that feature-level contrastive learning and dynamic feature encoding provide an effective recipe for learning from tabular sensing data.
[CV-52] Decoding Scientific Experimental Images: The SPUR Benchmark for Perception Understanding and Reasoning ACL2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学实验图像感知、理解与推理能力上的显著不足问题,尤其是在面对复杂科研图像时难以达到人类专家水平的挑战。其解决方案的关键在于构建了一个名为SPUR的综合性基准测试集,包含4,264个问答对,源自1,084张由专家精心筛选的科学图像,并从三个维度进行系统评估:(1) 面板级细粒度感知(数值、形态和信息定位),(2) 跨面板关系理解(平均每个样本含14.3个面板),以及(3) 专家级推理(涵盖五种实验范式,评估定性与定量推理能力)。实证结果显示,现有20种MLLMs及四种多模态思维链(MCoT)方法均未达到专家水平,揭示了AI for Science(AI4S)研究中的关键瓶颈。
链接: https://arxiv.org/abs/2604.27604
作者: Junpeng Ding,Zichen Tang,Haihong E,Mengyuan Ji,Yang Liu,Haolin Tian,Haiyang Sun,Pengqi Sun,Yang Xu,Yichen Liu,Haocheng Gao,Zijie Xi,Ruomeng Jiang,Peizhi Zhao,Rongjin Li,Yuanze Li,Jiacheng Liu,Zhongjun Yang,Jintong Chen,Siying Lin
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted to ACL 2026 Main Conference
Abstract:We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs’ ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.
[CV-53] SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning CVPR2026
【速读】:该论文旨在解决开放世界半监督学习(Open-world Semi-supervised Learning, OWSSL)中模型无法直接从候选文本标签集中准确预测语义相关标签的问题。现有方法因缺乏对新类别(novel classes)的显式监督信号,且未能有效提取和对齐跨模态语义表示,导致预测标签与候选文本标签之间缺乏语义对应关系。解决方案的关键在于提出SEmantic Capture for Open-world Semi-supervised learning (SECOS),其通过引入外部知识来提取并对齐已知类和新类别的多模态语义表示,从而为新类别提供显式的监督信号,使模型能够直接预测候选文本标签而无需后处理,满足实际应用中对精确分类的需求。
链接: https://arxiv.org/abs/2604.27596
作者: Hezhao Liu,Jiacheng Yang,Junlong Gao,Mengke Li,Yiqun Zhang,Shreyank N Gowda,Yang Lu
机构: Xiamen University (厦门大学); Shenzhen University (深圳大学); Guangdong University of Technology (广东工业大学); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:In open-world semi-supervised learning (OWSSL), a model learns from labeled data and unlabeled data containing both known and novel classes. In practical OWSSL applications, models are expected to perform rigorous classification by directly selecting the most semantically relevant label from a candidate set for each sample. Existing OWSSL methods fail to achieve this because novel samples are trained without explicit supervision, and these methods lack mechanisms to extract latent semantic information, resulting in predicted labels that have no semantic correspondence to candidate textual labels. To address this, we introduce SEmantic Capture for Open-world Semi-supervised learning (SECOS), which directly predicts textual labels from the candidate set without post-processing, meeting the requirements of practical OWSSL applications. SECOS leverages external knowledge to extract and align semantic representations across modalities for both known and novel classes, providing explicit supervisory signals for training novel classes. Extensive experiments demonstrate that even when existing OWSSL methods are evaluated under the more lenient post-hoc matching setting, SECOS still surpasses them by up to 5.4% without such assistance, highlighting its superior effectiveness. Code is available at this https URL.
[CV-54] ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval
【速读】:该论文旨在解决视频片段检索(video moment retrieval)任务中,现有模型因仅在片段级别(snippet-level)计算视觉-语言相似度而忽略多个匹配片段间语义关系的问题,导致模型易受上下文干扰且难以排除无关片段。其解决方案的关键在于提出ClipTBP框架,通过引入片段级对齐损失(clip-level alignment loss)显式建模多个答案片段间的语义关联,并结合主边界损失与辅助边界损失实现更精准的时序边界预测,从而提升模型在模糊查询场景下的鲁棒性与性能。
链接: https://arxiv.org/abs/2604.27591
作者: Ji-Hyeon Kim,Ho-Joong Kim,Seong-Whan Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.
[CV-55] Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering ICPR2026
【速读】:该论文旨在解决3D伪造内容(3D fake content)的检测问题,即如何识别由3D重建与神经渲染技术(如3D Gaussian Splatting)生成的、具有高度视觉真实性的篡改场景。当前主流的伪造检测方法主要局限于2D空间,难以有效区分原始图像与经过几何、外观或空间布局操控后的3D渲染图像。解决方案的关键在于提出Fake3DGS数据集和一种面向3D感知的检测方法:前者提供了基于3D Gaussian Splatting的可控篡改场景及其多视角渲染结果,后者利用多视角一致性以及从高斯点云表示中提取的特征来增强对3D篡改痕迹的判别能力,从而显著提升对修改后3D内容的识别准确率,验证了超越传统2D证据的必要性。
链接: https://arxiv.org/abs/2604.27590
作者: Davide Di Nucci,Riccardo Catalini,Guido Borghi,Roberto Vezzani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2026. Code and data: this https URL
Abstract:Recent advances in 3D reconstruction and neural rendering,particularly 3D Gaussian Splatting, make it feasible and simple to edit 3D scenes and re-render them as highly realistic images. Therefore, security concerns arise regarding the authenticity of 3D content. Despite this threat, 3D fake detection remains largely unexplored in the literature, and most existing work is limited to 2D space. Therefore, in this paper, we formalize the concept of 3D fake detection and introduce Fake3DGS, a dataset of 3D Gaussian splatting scenes and corresponding rendered views, where fake images are produced by controlled manipulations of geometry, appearance, and spatial layout, while preserving high visual realism. Using this benchmark, we demonstrate that current state-of-the-art 2D detectors struggle to distinguish between original and 3D manipulated images. To bridge this gap, we introduce a 3D-aware detection method that leverages multi-view coherence and features derived from the Gaussian splatting representation. Experimental results demonstrate a substantial improvement in recognizing modified 3D content, underscoring the validity of the new dataset and the necessity for authenticity assessment techniques that extend beyond 2D evidence. Code and data are publicly released for future investigations.
[CV-56] Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark
【速读】:该论文旨在解决胰腺导管腺癌(pancreatic ductal adenocarcinoma, PDAC)术前血管侵犯(vascular invasion, VI)评估中因图像边界模糊和专家间一致性差导致的AI模型可靠性不足问题。其核心解决方案是构建CURVAS-PDACVI数据集与挑战赛,提供每例扫描五名独立专家标注的密集标注数据,并提出一种多指标评价框架,不仅关注空间重叠度,还引入概率校准和VI判读能力作为关键评估维度。实验表明,单纯优化全局体积重叠的方法在高复杂度、低共识区域表现不佳,而能建模专家分歧的不确定性感知概率模型则展现出更强鲁棒性,从而推动了面向临床决策的预后评估从传统分割精度向局部边界可靠性转变。
链接: https://arxiv.org/abs/2604.27582
作者: M. Riera-Marín,O. K. Sikha,J. Rodríguez-Comas,M. S. May,T. Kirscher,X. Coubez,P. Meyer,S. Faisan,Z. Pan,X. Zhou,X. Liang,C. Hémon,V. Boussot,J.-L. Dillenseger,J.-C. Nunes,K.-C. Kahl,C. Lüth,J. Traub,P.-H. Conze,M. M. Duh,A. Aubanell,R. de Figueiredo Cardoso,S. Egger-Hackenschmidt,J. García-López,M. A. González-Ballester,A. Galdran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical resection remains the only potentially curative treatment for pancreatic ductal adenocarcinoma (PDAC), and eligibility depends on accurate assessment of vascular invasion (VI), i.e., tumor extension into adjacent critical vessels. Despite its importance for preoperative staging and surgical planning, computational VI assessment remains underexplored. Two major challenges are the lack of public datasets and the diagnostic ambiguity at the tumor-vessel interface, which leads to substantial inter-rater variability even among expert radiologists. To address these limitations, we introduce the CURVAS-PDACVI Dataset and Challenge, an open benchmark for uncertainty-aware AI in PDAC staging based on a densely annotated dataset with five independent expert annotations per scan. We also propose a multi-metric evaluation framework that extends beyond spatial overlap to include probabilistic calibration and VI assessment. Evaluation of six state-of-the-art methods shows that strong global volumetric overlap does not necessarily translate into reliable performance at clinically critical tumor-vessel interfaces. In particular, methods optimized for binary segmentation perform competitively on average overlap metrics, but often degrade in high-complexity cases with low expert consensus, either collapsing in volume or overextending at uncertain boundaries. In contrast, methods that model inter-rater disagreement produce better calibrated probabilistic maps and show greater robustness in these ambiguous cases. The benchmark highlights the limitations of volumetric accuracy as a proxy for localized surgical utility, motivating uncertainty-aware probabilistic models for preoperative decision-making.
[CV-57] World2Minecraft: Occupancy-Driven Simulated Scenes Construction
【速读】:该论文旨在解决现有高保真仿真环境在支持感知与决策任务时存在的数据污染(data contamination)和灵活性不足的问题,以及当前3D语义占据预测(3D semantic occupancy prediction)模型因数据稀缺和泛化能力差而导致重建质量受限的挑战。其解决方案的关键在于提出World2Minecraft框架,该框架基于3D语义占据预测将真实世界场景转化为结构化的Minecraft环境,并构建了一个低成本、自动化且可扩展的数据采集管道,用于生成定制化的占据数据集MinecraftOcc——该数据集包含156个丰富室内场景的100,165张图像,显著提升了占据预测性能并为个性化具身智能(embodied intelligence)研究提供了可编辑、可定制的平台。
链接: https://arxiv.org/abs/2604.27578
作者: Lechao Zhang,Haoran Xu,Jingyu Gong,Xuhong Wang,Yuan Xie,Xin Tan
机构: East China Normal University (华东师范大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation(VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed indoor scenes. Extensive experiments show that our dataset provides a critical complement to existing datasets and poses a significant challenge to current SOTA methods. These findings contribute to improving occupancy prediction and highlight the value of World2Minecraft in providing a customizable and editable platform for personalized embodied AI research. Project page:this https URL.
[CV-58] RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation ALT
【速读】:该论文旨在解决放射学报告生成(Radiology Report Generation, RRG)中图像与文本之间细粒度对齐不足的问题,尤其是现有方法将报告视为扁平序列而忽略其段落、句子和词语层级结构,导致跨模态对齐不精确,进而影响生成报告的准确性。解决方案的关键在于提出一种端到端的RIHA(Report-Image Hierarchical Alignment Transformer)框架,通过引入视觉特征金字塔(Visual Feature Pyramid, VFP)和文本特征金字塔(Text Feature Pyramid, TFP),实现图像与报告在段落、句子和词三个层次上的多级对齐;并通过交叉模态层次对齐(Cross-modal Hierarchical Alignment, CHA)模块结合最优传输(optimal transport)机制,强化不同粒度下的跨模态映射能力,同时采用相对位置编码(Relative Positional Encoding, RPE)增强解码器对token间空间与语义关系的建模,从而显著提升生成报告的自然语言质量和临床有效性。
链接: https://arxiv.org/abs/2604.27559
作者: Yucheng Chen,Yang Yu,Yufei Shi,Conghao Xiong,Xulei Yang,Si Yong Yeo
机构: MedVisAI Lab, Lee Kong Chian School of Medicine, Nanyang Technological University (NTU), and Centre of AI in Medicine, Singapore; Department of Machine Intellection, Institute for Infocomm Research (I2R), Agency for Science, Technology and Research, (A*STAR), Singapore; Department of Computer Science and Engineering, the Chinese University of Hong Kong, Hong Kong SAR, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Journal of Biomedical and Health Informatics (JBHI)
Abstract:Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists’ workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical alignment enables more precise cross-modal mapping, essential for capturing the nuanced semantics embedded in clinical narratives. Specifically, RIHA introduces a Visual Feature Pyramid (VFP) to extract multi-scale visual features and a Text Feature Pyramid (TFP) to represent multi-granularity textual structures. These components are integrated through a Cross-modal Hierarchical Alignment (CHA) module, leveraging optimal transport to effectively align visual and textual features across various levels. Furthermore, we incorporate Relative Positional Encoding (RPE) into the decoder to model spatial and semantic relationships among tokens, enhancing the token-level alignment between visual features and generated text. Extensive experiments on two benchmark chest X-ray datasets, IU-Xray and MIMIC-CXR, demonstrate that RIHA outperforms existing state-of-the-art models in both natural language generation and clinical efficacy metrics.
[CV-59] Revealing the Impact of Visual Text Style on Attribute-based Descriptions Produced by Large Visual Language Models ICMR2026
【速读】:该论文旨在解决视觉文本风格(如功能性字体与装饰性字体)是否以及如何影响大型视觉语言模型(LVLM)对概念属性描述的问题。研究发现,即使LVLM能够准确识别图像中文本所指的概念,文本的视觉风格仍会显著干扰其生成的属性描述,表明存在非平凡的风格泄露现象。解决方案的关键在于揭示这种风格对语义推理的隐式影响,并提出需在评估和优化阶段引入风格感知机制,以提升LVLM在多媒体系统中的鲁棒性和语义准确性。
链接: https://arxiv.org/abs/2604.27553
作者: Xiaomeng Wang,Martha Larson,Zhengyu Zhao
机构: Radboud University (奈梅亨大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICMR 2026. Code is available at this https URL
Abstract:When the visual style of text is considered, a wide variety can be observed in font, color, and size. However, when a word is read, its meaning is independent of the style in which it has been written or rendered. In this paper, we investigate whether, and how, the style in which a word is visualized in an image impacts the description that a Large Visual Language Model (LVLM) provides for the concept to which that word refers. Specifically, we investigate how functional text styles (readability-oriented, e.g., black sans-serif) versus decorative styles (display-oriented, e.g., colored cursive/script) affect LVLMs’ descriptions of a concept in terms of the attributes of that concept. Our experiments study the situation in which the LVLM is able to correctly identify the concept referred to by a visual text, i.e., by a word or words rendered as an image, and in which the visual text style should not influence the attribute-based description that the LVLM produces. Our experimental results reveal that even when the concept is correctly identified, text style influences the model’s attribute-based descriptions of the concept. Our findings demonstrate non-trivial style leakage from text style into semantic inference and motivate style-aware evaluation and mitigation for LVLM-based multimedia systems.
[CV-60] Residual Gaussian Splatting for Ultra Sparse-View CBCT Reconstruction
【速读】:该论文旨在解决在超稀疏视角条件下,基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的锥形束计算机断层成像(cone-beam computed tomography, CBCT)重建中因光度优化固有的谱偏差(spectral bias)导致的过度平滑和高频解剖细节丢失问题。其解决方案的关键在于提出残差高斯泼溅(Residual Gaussian Splatting, RGS),通过构建一种频谱解耦的高斯表示:将体素场分解为几何基础分量与残差细节分量,从而将显式的高频拟合转化为物理一致的隐式残差补偿任务;同时设计频谱-空间协同优化策略,协调几何锚定与纹理精化之间的相互作用,有效抑制频谱串扰,实现复杂骨小梁和血管结构中伪影抑制与细节保留之间的平衡。
链接: https://arxiv.org/abs/2604.27552
作者: Jian Lin,Jiancheng Fang,Shaoyu Wang,Changan Lai,Yikun Zhang,Yang Chen,Qiegen Liu
机构: Nanchang University (南昌大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While 3D Gaussian splatting (3DGS) offers explicit and efficient scene representations for cone-beam computed tomography reconstruction, conventional photometric optimization inherently suffers from spectral bias under ultra sparse-view conditions, leading to over-smoothing and a loss of high-frequency anatomical details. Since wavelet transforms provide rich high-frequency information and have been widely utilized to enhance sparse reconstruction, this work integrates wavelet multi-resolution analysis with 3DGS. To circumvent the mathematical mismatch between the strict non-negativity of physical X-ray attenuation and the bipolar nature of high-frequency wavelet coefficients, we propose Residual Gaussian Splatting (RGS). Methodologically, we introduce a spectrally-decoupled Gaussian representation that stratifies the volumetric field into a geometric base component and a residual detail component. This decomposition systematically transforms explicit high-frequency fitting into a physically consistent, implicit residual compensation task. Furthermore, we devise a spectral-spatial collaborative optimization strategy to coordinate the interplay between geometric anchoring and texture refinement, effectively preventing spectral crosstalk. Extensive experiments on clinical datasets demonstrate that RGS enables the reconstructed images to capture highly refined geometric textures. It successfully resolves the trade-off between artifact suppression and detail preservation, yielding superior visual fidelity in complex trabecular and vascular structures compared to existing neural rendering baselines.
[CV-61] Self-Supervised Learning of Plant Image Representations
【速读】:该论文旨在解决植物图像识别中因依赖专家标注数据而导致的监督学习局限性问题,提出通过自监督学习(Self-supervised Learning, SSL)构建可扩展的植物物种识别模型。其解决方案的关键在于:首先,识别并摒弃传统SSL中对植物图像有害的增强策略(如高斯模糊、灰度化和太阳化),这些操作会破坏细粒度识别所需的细微判别特征;其次,引入更适合植物图像特性的替代增强方法(如仿射变换和海报化),从而提升表示学习效果;最后,强调使用领域特定数据集(如iNaturalist 2021 Plantae子集)训练SSL模型的重要性,相较于ImageNet-1K,能显著提升下游任务性能,尤其在少样本场景下优于强监督基线模型(如Pl@ntCLEF和BioCLIP)。
链接: https://arxiv.org/abs/2604.27538
作者: Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Jean-Christophe Lombardo,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated plant recognition plays a crucial role in biodiversity monitoring and conservation, yet current approaches rely heavily on supervised learning, which is limited by the availability of expert-labeled data. Self-supervised learning (SSL) offers a scalable alternative, but existing methods and training protocols are largely designed for coarse-grained visual tasks and may not transfer well to fine-grained domains such as plant species recognition. In this work, we investigate SSL for plant image representation learning. We show that commonly used augmentations in SSL pipelines - such as Gaussian blur, grayscale conversion, and solarization - are detrimental in the context of plant images, as they remove subtle discriminative cues essential for fine-grained recognition. We instead identify alternative transformations, including affine and posterization, that are better suited to this domain. We further demonstrate that training SimDINOv2 on the iNaturalist 2021 Plantae subset yields significantly stronger representations than training on ImageNet-1K, highlighting the importance of domain-specific data for SSL. Our findings are consistent across both ViT-Base and ViT-Large architectures. Moreover, our models achieve competitive performance and sometimes outperform strong supervised baselines Pl@ntCLEF and BioCLIP on downstream plant recognition tasks in few-shot settings. Overall, our results highlight the critical importance of domain-adapted augmentation strategies and dataset selection in self-supervised learning, and provide practical guidelines for building scalable models for biodiversity monitoring.
[CV-62] Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)可解释性领域中一个长期未被验证的核心假设——即视觉编码器通过抑制背景像素、分类器仅从清理后的特征池中选择来实现分类的“空间漏斗假说”(Spatial Funnel Hypothesis)。现有可视化工具因存在空间幻觉(spatial hallucinations),无法提供像素级证据支持或反驳该假说。论文提出了一种无幻觉的图像重构框架,其关键在于引入幅度-相位解耦(magnitude-phase decoupling)与局部伴随校正器(Local Adjoint Correctors),数学上保证每个重构的空间梯度支持严格来自真实激活通道。基于此几何探测工具,作者首次在像素层面揭示了视觉编码器中强烈的叠加效应(strong superposition):单通道重构呈现均匀全息特性(holographic),正负权重重建在视觉和能量上不可区分,但其代数和集中于前景区域,从而证明分类机制依赖于破坏性干涉(destructive interference)——分类器权重在像素空间中抵消共享背景方向并构造类判别残差,直接证伪了空间漏斗假说。该干涉模型进一步将可容许干涉子空间的体积定义为控制通道需求的几何量,并证明该体积与全局平均池化(GAP)协方差行列式对偶,由此导出具有(1−1/e)近似保证的协方差-体积通道选择算法,且能定量揭示分布外(OOD)失败源于干涉体积塌缩。该框架无需重新训练即可无缝扩展至基于注意力的模块。
链接: https://arxiv.org/abs/2604.27529
作者: Kaixiang Shu
机构: Chongqing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A foundational assumption in CNN interpretability – that deep encoders suppress background pixels while classifiers merely select from a cleaned feature pool (the Spatial Funnel Hypothesis) – remains untested due to spatial hallucinations in existing visualization tools. We address this by introducing a hallucination-free inversion framework built on magnitude-phase decoupling and Local Adjoint Correctors. Our method mathematically guarantees that the spatial gradient support of every reconstruction stems strictly from genuinely active channels. Using this framework as a geometric probe, we uncover the first pixel-level evidence of strong superposition in vision encoders. We show that per-channel inversions are uniformly holographic: positive and negative weight reconstructions are visually and energetically indistinguishable. However, their algebraic sum sharply concentrates on the foreground. This proves classification operates via destructive interference – classifier weights cancel a shared background direction in pixel space and constructively assemble class-discriminative residuals, directly falsifying the Spatial Funnel Hypothesis. This interference model identifies the volume of the admissible interference subspace as the geometric quantity governing channel requirements. We prove this volume is dual to the GAP covariance determinant, yielding a covariance-volume channel selection algorithm with a (1-1/e) approximation guarantee. This algorithm mathematically reveals out-of-distribution (OOD) failure as a measurable collapse of the covariance volume essential for interference-based classification. Our framework extends seamlessly to attention-based heads without retraining. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.27529 [cs.CV] (or arXiv:2604.27529v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.27529 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-63] FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在数据统计异构性(statistical heterogeneity)下性能下降的问题。现有聚类方法通常依赖原始数据统计量、模型参数或启发式相似性度量,难以捕捉跨异构域的类别级语义结构,且常需迭代协调,效率低下。其解决方案的关键在于提出一种“一次-shot”、面向类别的客户端聚类框架FMCL,利用预训练基础模型(foundation model)提取每个客户端的类别级嵌入原型(class-level embedding prototypes),通过余弦距离计算类感知表示间的相似性进行聚类;该方法在训练前一次性完成聚类,不引入额外通信开销,且对下游模型架构无感,从而在非独立同分布(non-IID)数据划分下显著提升联邦性能与聚类稳定性。
链接: https://arxiv.org/abs/2604.27510
作者: Mahad Ali,Laura J. Brattain
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 figures
Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, yet its performance deteriorates under statistical heterogeneity. Clustered Federated Learning addresses this challenge by grouping similar clients and training separate models per cluster. However, existing clustering strategies often rely on raw data statistics, model parameters, or heuristic similarity measures that fail to capture class-level semantic structure across heterogeneous domains and frequently require iterative coordination. We propose FMCL, a one-shot, class-aware client clustering framework that leverages foundation model representations to construct semantic client signatures. Using a frozen foundation model, FMCL computes class-level embedding prototypes for each client and measures similarity via cosine distance between their class-aware representations. Clustering is performed once prior to training, introducing no additional communication during federated optimization and remaining agnostic to the downstream model architecture. Extensive experiments across heterogeneous benchmarks demonstrate that FMCL improves federated performance and yields more stable clustering behavior compared to existing clustering-based methods under non-identically distributed data partitioning.
[CV-64] Leverag ing Verifier-Based Reinforcement Learning in Image Editing
【速读】:该论文旨在解决图像编辑任务中缺乏通用且可靠的奖励模型(reward model)的问题,现有方法通常仅提供整体评分而忽略具体指令要求,导致奖励偏差。其解决方案的关键在于从简单的评分器转向基于链式思维(Chain-of-Thought, CoT)的推理验证器(reasoning verifier),提出 Edit-R1 框架,构建一种基于推理的奖励模型(Reasoning Reward Model, RRM)。该 RRM 将编辑指令分解为不同原则,逐项评估图像是否符合这些原则,并聚合细粒度检查结果生成可解释的奖励信号,从而提升图像编辑任务的准确性和可控性。
链接: https://arxiv.org/abs/2604.27505
作者: Hanzhong Guo,Jie Wu,Jie Liu,Yu Gao,Zilyu Ye,Linxiao Yuan,Xionghui Wang,Yizhou Yu,Weilin Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start’’ to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
[CV-65] REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement CVPR2026
【速读】:该论文旨在解决从二维图像生成具有丰富体积信息的三维资产(3D assets)这一难题,尤其是在输入图像缺乏足够三维线索时,现有生成模型难以恢复真实感的三维结构。其核心解决方案是提出一个两阶段、可插拔的流水线REVIVE 3D:第一阶段通过膨胀前景轮廓(foreground silhouette)构建“膨胀先验”(Inflated Prior),以恢复全局体积并融合局部结构细节;第二阶段利用3D潜在空间精炼(3D Latent Refinement)技术,在先验的潜在表示中注入高斯噪声并进行去噪,从而借助预训练模型的几何先验知识实现高质量三维重建。该方法显著提升了在挑战性数据集上的三维生成性能,并支持图像条件下的三维编辑。
链接: https://arxiv.org/abs/2604.27504
作者: Hankyeol Lee,Wooyeol Baek,Seongdo Kim,Jongyoo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior’s latent and then denoises it, using the prior’s geometric cues to leverage the backbone’s pretrained 3D knowledge. Furthermore, our framework supports image-conditioned 3D editing. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.
[CV-66] owards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark
【速读】:该论文旨在解决非结构化道路夜间自主驾驶中可见光感知不可靠的问题,以及现有单帧方法因帧间不一致性导致的时序自由空间检测性能受限问题。其关键解决方案是提出首个大规模全天候红外时序自由空间检测数据集IRON,并基于此构建了无需光流(flow-free)的新型框架IRONet,通过记忆注意力机制聚合历史上下文信息并设计精细化掩码解码器,有效缓解了帧间不一致问题,在实时推理下实现82.93% IoU和90.66% F1分数,且具备向RGB模态在ORFD和Rellis数据集上的良好泛化能力。
链接: https://arxiv.org/abs/2604.27499
作者: Shuo Wang,Jilin Mei,Wenfei Guan,Shuai Wang,Yan Xing,Chen Min,Yu Hu
机构: Research Center for Intelligent Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Beijing Institute of Control Engineering (北京控制工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Off-road nighttime autonomous driving suffers from unreliable visible-light perception, making infrared modality crucial for accurate freespace detection. However, progress remains limited due to the scarcity of annotated infrared off-road datasets and the inter-frame inconsistencies inherent to current single-frame methods. To address these gaps, we present the IRON dataset, which, to our knowledge, is the first large-scale infrared dataset for off-road temporal freespace detection under all-day conditions, with strong support for nighttime perception. The dataset comprises 24,314 densely annotated infrared images with synchronized RGB images in diverse scenes and different light conditions. Building upon this dataset, we propose IRONet, a novel flow-free framework for temporal freespace detection that addresses inter-frame inconsistencies by aggregating historical context via a memory-attention mechanism and a carefully designed mask decoder. On our IRON dataset, IRONet achieves state-of-the-art performance, reaching 82.93%(+1.19%) IoU and 90.66%(+0.71%) F1 score at real-time inference. Remarkably, IRONet also exhibits robust generalization to RGB modalities on ORFD and Rellis datasets. Overall, our work establishes a foundation for reliable all-day off-road autonomous driving and future research in infrared temporal perception. The code and IRON dataset are available at this https URL.
[CV-67] Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction
【速读】:该论文旨在解决4D人体-物体交互(4D Human-Object Interaction, HOI)建模问题,即如何在统一框架下处理多种条件输入(如文本、人体运动和物体运动),以支持多样化的HOI相关任务。现有方法通常依赖于特定任务的架构,缺乏对多模态数据联合建模的能力。其解决方案的关键在于提出Uni-HOI框架,通过引入大型语言模型(Large Language Models, LLMs)与两个针对运动特性的矢量量化变分自编码器(Vector Quantized Variational Autoencoders, VQ-VAEs),将异构的运动数据转化为适配LLM输入的标记序列,从而实现文本、人体运动与物体运动三者的无缝集成与联合建模。此外,采用两阶段训练策略:第一阶段在大规模HOI数据集上进行多任务学习以捕捉三者间的潜在关联,第二阶段针对具体任务微调,显著提升了模型在多个HOI任务上的性能表现。
链接: https://arxiv.org/abs/2604.27491
作者: Mengfei Zhang,Jinlu Zhang,Zhigang Tu
机构: Wuhan University (武汉大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage training strategy: the first stage performs multi-task learning on a large-scale HOI dataset to capture the underlying correlations among the three modalities, while the second stage fine-tunes the model on specific tasks to further enhance performance. Extensive experiments demonstrate that Uni-HOI achieves remarkable performances on multiple HOI-related tasks including text-driven HOI generation, object motion-driven human motion generation (optionally with text) and human motion-driven object motion prediction within a unified framework.
[CV-68] EdgeFM: Efficient Edge Inference for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在工业边缘部署中面临的低延迟确定性要求与资源受限环境下的性能瓶颈问题,以及现有框架因依赖臃肿通用设计或封闭式硬件专有生态而导致的硬件锁定和跨平台适应性差的问题。其解决方案的关键在于提出EdgeFM——一个轻量级、由AI代理驱动的VLM/LLM推理框架,通过移除非必要功能以降低单请求延迟,并将代理优化的低级内核配置封装为可复用的模块化技能库,从而避免对闭源实现的依赖,显著缩小与专有工具链之间的性能差距;该框架原生支持x86和NVIDIA Orin SoC等主流平台,并首次实现了国产Horizon Journey平台上的端到端视觉语言代理(Vision-Language Agent, VLA)部署,验证了其优异的跨平台可移植性和推理性能,在多数场景下优于传统厂商特定工具链,最高可达TensorRT-Edge-LLM的1.49倍加速比。
链接: https://arxiv.org/abs/2604.27476
作者: Mengling Deng,Yuanpeng Chen,Sheng Yang,Wei Tao,Wenhai Zhang,Hui Song,Linyuanhao Qin,Kai Zhao,Xiaojun Ye,Shanhui Mo,Jingli Fan,Shuang Zhang,Bei Liu,Tiankun Zhao,Xiangjing An
机构: Go Further. AI; Fudan University; RUYi Dynamics Co. Ltd; Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technique Report version
Abstract:Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.
[CV-69] LA-Pose: Latent Action Pretraining Meets Pose Estimation
【速读】:该论文旨在解决相机位姿估计(camera pose estimation)中对大量3D标注数据依赖的问题,当前主流方法多采用全监督训练,成本高昂且难以扩展。其解决方案的关键在于引入基于逆动力学(inverse-dynamics)的自监督预训练机制,通过学习视频帧间的隐式动作表示(latent action representations),将这些特征作为相机位姿估计器的输入,并在少量高质量3D标注数据上进行微调(fine-tune)。该方法被称为LA-Pose,不仅实现了高精度和强泛化能力,还保持了前向推理效率,实验表明在Waymo和PandaSet等驾驶基准上显著优于现有前馈方法,且所需标注数据量减少数个数量级。
链接: https://arxiv.org/abs/2604.27448
作者: Zhengqing Wang,Saurabh Nair,Prajwal Chidananda,Pujith Kachana,Samuel Li,Matthew Brown,Yasutaka Furukawa
机构: Wayve; Simon Fraser University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.
[CV-70] Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed CVPR2026
【速读】:该论文旨在解决非语言具身智能体(如家猫、前语言期婴儿等)在复杂环境中的意图推断问题,其核心挑战在于行为观测数据往往存在噪声或信息不足,而环境上下文虽能提供强先验信息,若使用不当则易导致模型依赖表面关联(shortcut learning),从而在模糊情境下产生错误预测。解决方案的关键在于提出一种受贝叶斯启发的多模态概率推理框架CatSignal,该框架将空间上下文建模为类似先验的约束,并将姿态动态和声学线索视为证据,通过上下文门控的专家产品(context-gated Product-of-Experts)机制联合融合多源信息,从而生成后验式的意图分布。实验表明,该方法在家庭猫多模态数据集上实现了77.72%的最高准确率,显著优于传统特征拼接和晚期融合基线,尤其在抑制上下文驱动的短路失败方面表现突出。
链接: https://arxiv.org/abs/2604.27445
作者: Wenqian Zhang,Zehao Wang
机构: University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 Animal Workshop
Abstract:Many agents in real-world environments cannot reliably communicate their goals through language, including household pets, pre-verbal infants, and other non-speaking embodied agents. In such settings, intent must be inferred from incomplete behavioral observations in context-rich environments. This creates a core ambiguity: observable behavior is often noisy or underspecified, while context provides strong prior information but can also induce brittle shortcut predictions if used naively. We present CatSignal, a Bayesian-inspired probabilistic framework for multimodal intent inference that models spatial context as a prior-like constraint and behavioral observations as evidence. Rather than treating context as an ordinary input feature, our method uses a context-gated Product-of-Experts formulation to compute posterior-like intent distributions from context, pose dynamics, and acoustic cues. We instantiate this formulation in a household cat setting as a focused proof-of-concept for intent inference in non-speaking agents. Under Leave-One-Video-Out evaluation on a multimodal domestic cat dataset, the proposed prior-guided fusion achieves the best overall accuracy of 77.72%, outperforming feature concatenation (71.83%) and stronger late-fusion baselines. More importantly, it substantially reduces context-driven shortcut failures in ambiguous cases. While simpler fusion strategies remain competitive in Macro-F1 and selective prediction, the proposed model provides the strongest overall accuracy and the best suppression of context-based shortcut collapse. Comments: Accepted to the CVPR 2026 Animal Workshop Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.27445 [cs.CV] (or arXiv:2604.27445v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.27445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-71] Softmax-GS: Generalized Gaussians Learning When to Blend or Bound
【速读】:该论文旨在解决3D高斯点绘(3D Gaussian Splatting, 3D GS)在新视角合成中面临的两大核心问题:一是由于高斯分布之间存在重叠时导致的视图不一致性(view inconsistency),二是高斯函数固有的扩散边界特性对物体锐利边缘重建精度的限制。解决方案的关键在于提出Softmax-GS,通过引入基于softmax的竞争机制,在重叠区域强制两个高斯之间进行可学习的相互竞争,从而实现从平滑色彩混合到清晰边界定义的连续过渡;该方法显式保持任意两个重叠高斯间的顺序不变性,并确保输出透射率不受重叠程度影响,有效避免渲染结果中的不连续现象,显著提升重建质量和参数效率。
链接: https://arxiv.org/abs/2604.27437
作者: Chen Ziwen,Peng Wang,Hao Tan,Zexiang Xu,Li Fuxin
机构: Adobe Research(Adobe研究院); Tripo AI(Tripo AI); Hillbot(Hillbot); Oregon State University(俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3D GS) is widely adopted for novel view synthesis due to its high training and rendering efficiency. However, its efficiency relies on the key assumption that Gaussians do not overlap in the 3D space, which leads to noticeable artifacts and view inconsistencies. In addition, the inherently diffuse boundaries of Gaussians hinder accurate reconstruction of sharp object edges. We propose Softmax-GS, a unified solution that addresses both the view-inconsistency and the diffuse-boundary problem by enforcing a softmax-based competition in overlapping regions between two Gaussians. With learnable parameters controlling the strength of the competition, it enables a continuous spectrum from smooth color blending to crisp, well-defined boundaries. Our formulation explicitly preserves order invariance for any two overlapping Gaussians and ensures that the output transmittance remains unchanged irrespective of the extent of overlapping, preventing undesirable discontinuities in the rendered output. Ablation experiments on simple geometries demonstrate the effectiveness of each component of Softmax-GS, and evaluations on real-world benchmarks show that it achieves state-of-the-art performance, improving both reconstruction quality and parameter efficiency.
[CV-72] Sparse-View 3D Gaussian Splatting in the Wild
【速读】:该论文旨在解决在存在干扰物(distractors)的非受限真实场景中,从稀疏视图集合进行高质量三维重建与渲染的问题。现有方法或局限于约束条件下稀疏图像的视图合成,或依赖密集图像集来增强三维表示,难以应对实际环境中稀疏且无约束的数据。其解决方案的关键在于两个创新:一是引入基于扩散模型(diffusion model)的参考引导视图精化机制,利用瞬态掩码(transient mask)和参考图像优化3D表示并减少渲染伪影;二是通过伪视图生成结合稀疏感知的高斯复制策略,在高斯场稀疏区域增强表示能力,从而提升整体渲染质量。实验表明,该方法在PSNR、SSIM和LPIPS指标上显著优于现有方法,为无需人工密集采集数据的真实场景三维重建提供了可行路径。
链接: https://arxiv.org/abs/2604.27422
作者: Wongi Park,Jordan A. James,Myeongseok Nam,Minjae Lee,Soomok Lee,Sang-Hyun Lee,William J. Beksi
机构: Ajou Univerity(ajou大学); University of Texas at Arlington(德克萨斯大学阿灵顿分校); GenGenAI; Seoul National University(首尔国立大学); Kennesaw State University(肯尼索州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 14 figures, and 14 tables
Abstract:We propose a 3D novel sparse-view synthesis framework for unconstrained real-world scenarios that contain distractors. Unlike existing methods that primarily perform novel-view synthesis from a sparse set of constrained images without transient elements or leverage unconstrained dense image collections to enhance 3D representation in real-world scenarios, our method not only effectively tackles sparse unconstrained image collections, but also shows high-quality 3D rendering results. To do this, we introduce reference-guided view refinement with a diffusion model using a transient mask and a reference image to enhance the 3D representation and mitigate artifacts in rendered views. Furthermore, we address sparse regions in the Gaussian field via pseudo-view generation along with a sparsity-aware Gaussian replication strategy to amplify Gaussians in the sparse regions. Extensive experiments on publicly available datasets demonstrate that our methodology consistently outperforms existing methods (e.g., PSNR - 17.2%, SSIM - 10.8%, LPIPS - 4.0%) and provides high-fidelity 3D rendering results. This advancement paves the way for realizing unconstrained real-world scenarios without labor-intensive data acquisition. Our project page is available at \hrefthis https URLhere
[CV-73] Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis
【速读】:该论文旨在解决生成式 AI(Generative AI)在自动驾驶场景中使用视觉语言模型(Vision-Language Models, VLMs)时,其对物理对抗攻击的鲁棒性不足的问题,特别是不同VLM架构之间是否存在对抗样本的迁移性,这可能带来实际部署中的安全风险。解决方案的关键在于开展系统性的跨架构对抗迁移性研究,通过在道路基础设施上放置物理可实现的扰动贴片(patches),在人行横道和高速公路两种典型场景下评估三种代表性VLM架构(Dolphins、OmniDrive 和 LeapVAD)的攻击效果,结果表明攻击在不同架构间具有高迁移率(平均迁移率为0.815–0.833),且即使未针对目标模型优化,仍能在64.7%–79.4%的关键决策窗口内持续干扰模型输出,揭示了当前VLM驱动系统在物理空间中的脆弱性与广泛性风险。
链接: https://arxiv.org/abs/2604.27414
作者: David Fernandez,Pedram MohajerAnsari,Amir Salarpour,Mert D. Pese
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 9 pages, 2 figures. Accepted at SAE WCX 2026
Abstract:Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their robustness to physical adversarial attacks, especially whether such attacks transfer across different VLM architectures, is not well understood and poses a practical risk when attackers do not know which model a vehicle uses. We address this gap with a systematic cross-architecture study of adversarial transferability in VLM-based driving, evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) using physically realizable patches placed on roadside infrastructure in both crosswalk and highway scenarios. Our transfer-matrix evaluation shows high cross-architecture effectiveness, with transfer rates of 73-91% (mean TR = 0.815 for crosswalk and 0.833 for highway) and sustained frame-level manipulation over 64.7-79.4% of the critical decision window even when patches are not optimized for the target model.
[CV-74] COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理交错式图文上下文(interleaved image-text contexts)时缺乏系统性评估基准的问题。现有基准大多聚焦于单图或多图理解,难以衡量模型在真实场景(如文档阅读)中对细粒度图文对应关系的识别与推理能力。解决方案的关键在于提出COHERENCE基准,该基准涵盖四个代表性领域,包含6,161个高质量问题,专门用于量化评估MLLMs在交错式多模态上下文中恢复细粒度图文对应关系的能力,并通过六类错误分析实现对模型失败原因的精细化归因,从而揭示当前模型在特定能力上的缺失。
链接: https://arxiv.org/abs/2604.27389
作者: Bingli Wang,Huanze Tang,Haijun Lv,Zhishan Lin,Lixin Gu,Lei Feng,Qipeng Guo,Kai Chen
机构: Southeast University (东南大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual this http URL, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.
[CV-75] VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching
【速读】:该论文旨在解决现有图像修饰(photo retouching)方法中依赖非可微分外部软件导致的优化障碍、参数冗余及泛化能力有限的问题。其关键解决方案在于提出一个轻量级且完全可微分的多任务图像修饰框架VeraRetouch:首先利用0.5B参数规模的视觉语言模型(Vision-Language Model, VLM)根据指令和场景语义生成修饰策略;其次设计了一个完全可微的修饰渲染器(Retouch Renderer),通过解耦控制潜变量实现光照、全局色彩和特定色彩调整的端到端像素级训练,替代传统外部工具;同时构建了首个百万级专业修饰数据集AetherRetouch-1M+以缓解数据稀缺问题,并引入DAPO-AE强化学习后训练策略提升自主美学认知能力,从而在保持模型轻量化的同时实现卓越的性能与移动端部署可行性。
链接: https://arxiv.org/abs/2604.27375
作者: Yihong Guo,Youwei Lyu,Jiajun Tang,Yizhuo Zhou,Hongliang Wang,Jinwei Chen,Changqing Zou,Qingnan Fan
机构: Zhejiang University (浙江大学); vivo BlueImage Lab; University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at this https URL.
[CV-76] DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration ICRA2026
【速读】:该论文旨在解决光学触觉传感器(Optical Tactile Sensor)在物理仿真中因高柔性和复杂光学特性带来的建模难题。传统模拟器依赖简化的变形模型,难以准确再现真实传感器的非线性大变形行为及光学响应。其解决方案的关键在于提出DOT-Sim:一种基于材料点法(Material Point Method, MPM)的可微分光学触觉仿真框架,通过将软体传感器建模为弹性材料实现物理层面的精确模拟,并引入残差图像学习机制来高效生成与真实场景一致的光学输出。该方法可在数分钟内完成小样本演示驱动的快速校准,显著优于现有基线方法,并在零样本仿真到现实迁移任务中验证了其物理和视觉真实性。
链接: https://arxiv.org/abs/2604.27367
作者: Yang You,Won Kyung Do,Aiden Swann,Rika Antonova,Monroe Kennedy,Leonidas Guibas
机构: Stanford University (斯坦福大学); University of Cambridge (剑桥大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted at ICRA 2026
Abstract:Simulating optical tactile sensors presents significant challenges due to their high deformability and intricate optical properties. To address these issues and enable a physically accurate simulation, we propose DOT-Sim: Differentiable Optical Tactile Simulation. Unlike prior simulators that rely on simplified models of deformable sensors, DOT-Sim accurately captures the physical behavior of soft sensors by modeling them as elastic materials using the Material Point Method (MPM). DOT-Sim enables rapid calibration of optical tactile sensor simulation using a small number of demonstrations within minutes, which is substantially faster than existing methods. Compared to current baselines, our approach supports much larger and non-linear deformations. To handle the optical aspect, we propose a novel approach to simulating optical responses by learning a residual image relative to the real-world idle state. We validate the physical and visual realism of our method through a series of zero-shot sim-to-real tasks. Our experiments show that DOT-Sim (1) accurately replicates the physical dynamics of a DenseTact optical tactile sensor in reality, (2) generates realistic optical outputs in contact-rich scenarios, (3) enables direct deployment of simulation-trained classifiers in the real world, achieving 85% classification accuracy on challenging objects and 90% accuracy in embedded tumor-type detection, and (4) allows precise trajectory following with a policy trained from demonstrations in simulation, with an average error of less than 0.9 mm.
[CV-77] Judge Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving
【速读】:该论文旨在解决当前基于视觉-语言-动作(Vision-Language-Action, VLA)模型的自动驾驶方法在复杂闭环场景中性能受限的问题,其核心在于未能充分利用VLA模型在其他大语言模型(Large Language Model, LLM)领域已被验证的“评判能力”(critic capability)来优化驾驶决策。解决方案的关键在于提出一个理论驱动的两阶段框架CriticVLA:第一阶段生成初步轨迹,第二阶段通过基于VLA的批评者(critic)对轨迹进行多模态评估与单步优化,从而实现更高品质的驾驶行为。该方法依赖于一个包含1290万条标注轨迹的大规模合成数据集,显著提升了批评者的推理与修正能力,最终在Bench2Drive基准上的闭环实验中实现了73.33%的总成功率,并在挑战性场景中获得约30%的性能提升。
链接: https://arxiv.org/abs/2604.27366
作者: Lijin Yang,Jianing Huang,Zhongzhan Huang,Shu Liu,Hao Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
Abstract:Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering diverse driving scenarios, which enhances the critic’s reasoning and refinement abilities. Extensive closed-loop experiments on the Bench2Drive benchmark show that CriticVLA significantly surpasses state-of-the-art baselines, achieving a 73.33% total success rate and delivering about 30% improvement in challenging scenarios.
[CV-78] Hyperspectral Image Classification via Efficient Global Spectral Supertoken Clustering
【速读】:该论文旨在解决高光谱图像分类中空间一致性与边界精确性难以兼顾的问题。现有基于超像素的方法存在固有矛盾:聚类将相似像素聚合为区域,但后续分类器仍以像素为单位进行预测,导致区域层面的一致性无法保障,进而影响边界对齐的准确性。解决方案的关键在于提出双阶段谱约束聚类分类器(DSCC),其核心是通过显式解耦聚类与分类过程,首先基于光谱相似性和空间邻近性生成谱级超令牌(spectral supertokens),再在令牌级别进行预测;同时引入多准则特征距离计算与局部感知分配正则化,确保生成的超令牌保留边界信息,并通过密度-隔离中心选择机制获得代表性且分离良好的聚类中心,从而提升对尺度变化的鲁棒性与冗余抑制能力。此外,设计软标签方案编码类别比例,增强对混合地物类型的适应性。
链接: https://arxiv.org/abs/2604.27364
作者: Peifu Liu,Tingfa Xu,Jie Wang,Huan Chen,Huiyan Bai,Jianan Li
机构: Beijing Institute of Technology (北京理工大学); Beijing Institute of Technology Chongqing Innovation Center (北京理工大学重庆创新中心); Key Laboratory of Photoelectronic Imaging Technology and System, Ministry of Education of China (光电成像技术与系统教育部重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISPRS JPRS 2026. This manuscript version is made available under the CC-BY-NC-ND 4.0 license
Abstract:Hyperspectral image classification demands spatially coherent predictions and precise boundary delineation. Yet prevailing superpixel-based methods face an inherent contradiction: clustering aggregates similar pixels into regions, but the subsequent classifier operates pixel-wise, undermining regional consistency. Consequently, existing approaches do not guarantee region-level, boundary-aligned classification. To address this limitation, we propose the Dual-stage Spectrum-Constrained Clustering-based Classifier (DSCC), an end-to-end framework that explicitly decouples clustering from classification by first grouping spectral similar and spatially proximate pixels into spectral supertokens and then performing token-level prediction. At its core, DSCC computes an image-level multi-criteria feature distance between pixels and centers, followed by a locality-aware assignment regularization, enabling the generation of boundary-preserving spectral supertokens. A density-isolation based center selection further yields representative, well-separated centers, reducing redundancy and improving robustness to scale variation. To accommodate mixed land-cover compositions within each token, we introduce a soft-label scheme that encodes class proportions and improves robustness for mixed-class tokens. DSCC attains a CF1 of 0.728 at 197.75 FPS on the WHU-OHS dataset, offering a superior accuracy-efficiency trade-off compared with state-of-the-art methods. Extensive experiments further validate the effectiveness and generality of the proposed dual-stage paradigm for hyperspectral image classification. The source code is available at this https URL.
[CV-79] CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling
【速读】:该论文旨在解决3D室内场景生成中因数据稀缺导致的全局建筑约束难以满足与局部语义一致性难以保证的问题,同时克服现有方法在处理密集关系图时引入冗余生成误差的局限性。其解决方案的关键在于提出一种级联扩散框架CasLayout,通过将联合场景生成任务分解为四个具有明确物理和语义角色的条件子阶段:(1)预测家具数量与类别,(2)优化物体尺寸与特征嵌入,(3)在潜在空间中建模空间关系,(4)生成定向边界框(Oriented Bounding Boxes, OBBs)。该架构显著降低数据需求,并支持灵活集成大语言模型(Large Language Models, LLMs)和视觉语言模型(Vision Language Models, VLMs)以实现零样本图像到场景生成;此外,通过显式建模墙体、门、窗等建筑元素作为条件约束,并采用稀疏关系图结构结合双向变分自编码器(Bidirectional Variational Autoencoder, VAE)编码至紧凑潜在空间,有效提升了布局的功能组织合理性与关系可控性。
链接: https://arxiv.org/abs/2604.27361
作者: Yingrui Wu,Youkang Kong,Mingyang Zhao,Weize Quan,Dong-Ming Yan,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: SIGGARPH 2026 (Journal Track), Code: this https URL
Abstract:Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.
[CV-80] AG-TAL: Anatomically-Guided Topology-Aware Loss for Multiclass Segmentation of the Circle of Willis Using Large-Scale Multi-Center Datasets
【速读】:该论文旨在解决脑血管疾病管理中Circle of Willis (CoW)多类别分割的准确性问题,其挑战主要源于血管拓扑结构复杂及形态变异大,现有深度学习方法常出现血管断裂和类间误分类,且传统拓扑损失函数在三维多类别场景下计算成本过高。解决方案的关键在于提出一种解剖引导的拓扑感知损失函数(Anatomically-Guided Topology-Aware Loss, AG-TAL),该损失函数由三部分组成:基于半径感知的Dice损失以缓解小血管类不平衡问题;基于断点感知的clDice损失利用组卷积高效保持局部连通性;以及基于邻接关系的共现损失,通过引入解剖先验强制相邻动脉间的边界清晰分离。实验表明,AG-TAL在多个独立数据集上均显著提升小血管分割性能,平均Dice分数达80.85%,并展现出良好的泛化能力和临床应用潜力。
链接: https://arxiv.org/abs/2604.27357
作者: Jialu Liu,Yue Cui,Shan Yu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology (脑认知与脑启发智能技术国家重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); School of Future Technology, University of Chinese Academy of Sciences (中国科学院大学未来技术学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, submitted to IEEE JBHI
Abstract:Accurate multiclass segmentation of the Circle of Willis (CoW) is essential for neurovascular disease management but remains challenging due to complex vascular topology and variable morphology. Existing deep learning methods often suffer from vascular discontinuities and inter-class misclassification, while current topological loss functions incur prohibitive computational costs in 3D multiclass settings. To address these limitations, we propose an Anatomically-Guided Topology-Aware Loss (AG-TAL) and introduce a large-scale, multi-center CoW dataset with unified annotations to facilitate robust model training. AG-TAL specifically integrates a radius-aware Dice loss to address class imbalance in small vessels, a breakage-aware clDice loss that utilizes group convolutions to efficiently preserve local connectivity, and an adjacency-aware co-occurrence loss that leverages anatomical priors to enforce distinct boundaries between neighboring arteries. Evaluated using 5-fold cross-validation, AG-TAL achieved an average Dice score of 80.85% for all CoW arteries, with small arteries notably higher by 1.05-3.09% compared to state-of-the-art methods. Across six independent datasets, the performance of AG-TAL achieved Dice scores ranging from 74.46% to 81.17% for all CoW arteries, with improvements of 2.20% to 9.98% for small arteries compared to other methods. This study demonstrates the superiority of AG-TAL in identifying multiclass CoW arteries and its ability to generalize well to multiple independent datasets. Furthermore, reliability analyses and clinical applications in an Alzheimer’s disease cohort validate the AG-TAL’s robustness and its potential for discovering imaging-based morphological biomarkers.
[CV-81] Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion
【速读】:该论文旨在解决现有步态识别方法在面对视角变化、衣物更换和携带状态等协变量干扰时,难以全面捕捉和利用人体运动中蕴含的丰富生物特征信息的问题。其解决方案的关键在于提出一种基于深度残差学习的多分支架构,通过高分辨率网络(HRNet)实现鲁棒的骨骼关键点估计,并构建体形比例、步态速度与骨骼运动三个互补特征分支;进一步设计了受通道注意力机制启发的多分支特征融合(MFF)模块,动态分配各分支贡献权重,从而有效整合异构特征流,显著提升识别精度,在CASIA-B跨视角多条件基准上实现了94.52%的Rank-1准确率,优于当前基于骨架的方法。
链接: https://arxiv.org/abs/2604.27353
作者: Yabo Luo,Xiaoyun Wang,Cunrong Li
机构: Osh State University (奥什州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gait recognition has emerged as a compelling biometric modality for surveillance and security applications, offering inherent advantages such as non-intrusiveness, resistance to disguise, and long-range identification capability. However, prevailing approaches struggle to comprehensively capture and exploit the rich biometric cues embedded in human locomotion, particularly under covariate interference including viewpoint variation, clothing change, and carrying conditions. In this paper, we present a high-precision gait recognition framework that deeply extracts and synergistically fuses gait dynamics with body shape characteristics through a multi-branch architecture grounded in deep residual learning. Specifically, we first employ the High-Resolution Network (HRNet) to perform robust skeletal keypoint estimation, preserving fine-grained spatial information even under low-resolution inputs. We then construct three complementary feature branches – body proportion, gait velocity, and skeletal motion – from the extracted pose sequences. A 50-layer Residual Network (ResNet-50) backbone is leveraged within a deep feature extraction module to capture hierarchically rich and discriminative representations. To effectively integrate heterogeneous feature streams, we design a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention mechanisms, which dynamically allocates contribution weights across branches through learned activation parameters. Extensive experiments on the cross-view multi-condition CASIA-B benchmark demonstrate that our method achieves a Rank-1 accuracy of 94.52% under normal walking, with the best recognition performance among skeleton-based methods for the coat-wearing condition.
[CV-82] JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification
【速读】:该论文旨在解决现有皮肤病变分类系统过度依赖皮肤镜图像(dermoscopic images),而忽视临床实践中常规获取的多模态证据(如临床照片和结构化患者元数据)的问题。其解决方案的关键在于提出一种名为JI-ADF的三模态深度学习框架,通过联合多模态表征学习、模态特定的辅助监督以及自适应决策融合机制,在样本层面动态校准各模态贡献;同时引入多模态融合注意力(multimodal fusion attention, MMFA)模块以增强跨模态推理并保留模态特异性证据,从而实现更可靠且临床意义明确的皮肤病变分类。
链接: https://arxiv.org/abs/2604.27343
作者: Phan Nguyen,Dat Cao,Quang Hien Kha,Hien Chu,Minh H. N. Le,Trang Quoc Thao Pham,Nguyen Quoc Khanh Le
机构: KAIST(韩国科学技术院); Taipei Medical University(台北医学大学); Yale University(耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbfJI-ADF, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.
[CV-83] Iterative Definition Refinement for Zero-Shot Classification via LLM -Based Semantic Prototype Optimization CVPR
【速读】:该论文旨在解决嵌入式零样本(zero-shot)网页内容分类在实际应用中因类别定义质量不佳而导致系统性误分类的问题。随着现代网络内容的动态演化,传统依赖标注数据的分类方法难以适应,而现有基于嵌入空间的零样本方法虽无需训练数据,但其性能高度依赖于类别描述词的质量——模糊或重叠的定义会在嵌入空间中引发语义混淆,进而降低分类准确性。论文提出一种无需训练、可迭代优化类别定义的自适应框架,核心创新在于不更新模型参数,而是利用大语言模型(LLM)作为反馈驱动的定义优化器,通过三种策略(示例引导、混淆感知和历史感知)对类别描述进行结构化迭代优化,从而提升分类性能。实验证明,该方法在13种主流嵌入基础模型上均能稳定改进效果,凸显了定义质量作为嵌入式系统关键因素的重要性。
链接: https://arxiv.org/abs/2604.27335
作者: Naeem Rehmat,Muhammad Saad Saeed,Ijaz Ul Haq,Khalid Malik
机构: University of Michigan-Flint (密歇根大学弗林特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR NeXD Workshop (2026)
Abstract:Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification. In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at this https URL. Comments: Accepted at CVPR NeXD Workshop (2026) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.27335 [cs.CV] (or arXiv:2604.27335v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.27335 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-84] SQuadGen: Generating Simple Quad Layouts via Chart Distance Fields SIGGRAPH2026
【速读】:该论文旨在解决3D形状在扫描、重建或AI生成后缺乏简单四边形网格布局(simple quad mesh layouts)的问题,而此类布局对于高效编辑和建模至关重要。现有四边形重网格化(quad-remeshing)方法通常生成包含复杂不规则环(irregular loops)的布局,导致人工清理繁琐且算法调优耗时。解决方案的关键在于提出SQuadGen,一种基于扩散机制(diffusion-based)的生成框架,其核心创新是引入Chart Distance Fields(CDF)——一种连续的基于表面的表示方法,以克服网格拓扑离散性对学习的阻碍;同时定义了环感知的简洁性度量(loop-aware simplicity metrics),并构建了一个大规模高质量四边形布局数据集,通过稳健的四边形恢复流水线从公开3D资源中提取。实验表明,SQuadGen能稳定生成艺术家友好的简单四边形布局,显著优于现有方法。
链接: https://arxiv.org/abs/2604.27329
作者: Youkang Kong,Yang Liu,Yue Dong,Xin Tong,Heung-Yeung Shum
机构: Tsinghua University (清华大学); Microsoft Research Asia (微软亚洲研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026 (Journal Track), project page: this https URL
Abstract:3D shapes from scanning, reconstruction, or AI-generated content often lack simple quad mesh layouts – critical for efficient editing and modeling. Existing quad-remeshing techniques typically produce complex layouts with irregular loops, leading to tedious manual cleanup and extensive algorithm tuning. We introduce SQuadGen, a diffusion-based generative framework that leverages Chart Distance Fields (CDF) to synthesize simple quad layouts on 3D shapes. Our approach addresses two key challenges: (1) the discrete nature of mesh connectivity, which hinders learning, and (2) the scarcity of large-scale datasets with simple quad meshes. To overcome the first, we propose CDF, a continuous surface-based representation enabling effective learning and synthesis of quad layouts. To address the second, we define loop-aware simplicity metrics and construct a large-scale dataset of high-quality quad layouts recovered from public 3D repositories through a robust quad-recovery pipeline. Extensive evaluations across diverse 3D inputs show that SQuadGen consistently outperforms existing methods, producing robust, artist-friendly simple quad layouts.
[CV-85] YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal CVPR2026
【速读】:该论文旨在解决基于扩散 Transformer (Diffusion Transformer, DiT) 的视频对象移除技术中存在的推理延迟过高问题,尤其是在处理仅局部遮罩区域时仍需对整个时空标记空间进行密集计算的低效现象。解决方案的关键在于提出 YOSE(You Only Select Essential Tokens)框架,其核心创新包括两个组件:一是可微分的动态索引操作符 Batch Variable-length Indexing (BVI),它根据遮罩信息自适应选择必要标记,实现跨样本的变长标记处理;二是扩散过程模拟模块 Diffusion Process Simulator (DiffSim),通过近似未遮罩区域在 DiT 自注意力机制中的影响,保持遮罩区域内语义一致性。该设计使推理时间与遮罩区域大小呈近似线性关系,显著优于传统全标记扩散方法的恒定计算量,从而在多数情况下实现最高达 2.5 倍的速度提升,同时保持与基线相当的视觉质量。
链接: https://arxiv.org/abs/2604.27322
作者: Chenyang Wu,Lina Lei,Fan Li,Chun-Le Guo,Dehong Kong,Xinran Qin,Zhixin Wang,Ming-Ming Cheng,Chongyi Li
机构: Nankai University (南开大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR2026
Abstract:Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: this https URL.
[CV-86] PINN-Cast: Exploring the Role of Continuous-Depth NODE in Transformers and Physics Informed Loss as Soft Physical Constraints in Short-term Weather Forecasting CCS2026
【速读】:该论文旨在解决传统物理驱动的数值天气预报(Numerical Weather Prediction, NWP)计算成本高、流程复杂,以及现有基于Transformer的数据驱动模型缺乏物理一致性的问题。其关键解决方案是提出一种连续深度Transformer编码器(continuous-depth transformer encoder),在每个编码器块中引入神经微分方程(Neural Ordinary Differential Equation, Neural ODE)动力学,以替代离散层更新,从而更好地建模平滑的潜在动态;同时设计了一个双分支注意力模块,融合常规的patch-wise自注意力与对注意力logits施加导数算子的辅助分支,增强对变化敏感的交互信号;此外,通过定制化的物理信息训练目标(physics-informed training objective)将物理一致性作为软约束嵌入优化过程,提升模型在短时天气预报中的准确性与物理合理性。
链接: https://arxiv.org/abs/2604.27313
作者: Hira Saleem,Flora Salim,Cormac Purcell
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 Figures, Accepted in 26th International Conference on Computational Science (ICCS 2026)
Abstract:Operational weather prediction has long relied on physics-based numerical weather prediction (NWP), whose accuracy comes at the cost of substantial compute and complex simulation workflows. Recent transformer-based forecasters offer efficient data-driven alternatives, however transformers are physics-agnostic models. Additionally, standard transformer encoders evolve representations through discrete layer updates that may be less suited to modeling smooth latent dynamics. In this work, we propose a continuous-depth transformer encoder for weather forecasting that integrates Neural Ordinary Differential Equation (Neural ODE) dynamics within each encoder block. Specifically, we replace discrete residual updates with ODE-based updates solved using adaptive numerical integration. We also introduce a two-branch attention module that combines conventional patch-wise self-attention with an auxiliary branch that applies a derivative operator to attention logits, providing an additional change-sensitive interaction signal. To further align forecasts with governing principles, we propose a customized physics-informed training objective that enforces physical consistency as a soft constraint. We evaluate the proposed method against a standard discrete transformer baseline and an existing continuous-time Neural ODE forecasting variant, demonstrating the importance of PINN-Cast in short term weather forecasting.
[CV-87] Student Classroom Behavior Recognition Based on Improved YOLOv8s
【速读】:该论文旨在解决真实课堂场景中学生行为识别面临的挑战,包括目标密度高、小物体多、遮挡频繁以及类别分布不均衡等问题。其解决方案的关键在于提出一种改进的YOLOv8s模型(ALC-YOLOv8s),通过引入SPPF-LSKA增强上下文特征提取能力,采用CFC-CRB与SFC-G2优化多尺度特征融合机制,并结合ATFLoss提升对少数类和难样本的学习能力,从而显著提升了复杂课堂环境下学生行为识别的准确率与鲁棒性。
链接: https://arxiv.org/abs/2604.27293
作者: Xiang Gao,Shuai Hang
机构: Shaanxi Normal University (陕西师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:In classroom teaching, student behavior can reflect their learning state and classroom participation, which is of great significance for teaching quality analysis. To address the problems of dense student targets, numerous small objects, frequent occlusions, and imbalanced class distribution in real classroom scenes, this paper proposes an improved student classroom behavior recognition model named ALC-YOLOv8s based on YOLOv8s. The model introduces SPPF-LSKA to enhance contextual feature extraction, employs CFC-CRB and SFC-G2 to optimize multi-scale feature fusion, and incorporates ATFLoss to improve the learning ability for minority classes and hard samples. Experimental results show that compared with the baseline model, the improved model achieves increases of 1.8% in mAP50 and 2.1% in mAP50-95. Compared with several mainstream detection methods, the proposed model can well meet the requirements of automatic student behavior recognition in complex classroom scenarios.
[CV-88] BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
【速读】:该论文旨在解决当前基于学习的脑部磁共振成像(MRI)分析方法普遍存在的任务特异性问题,以及对大量标注数据的高度依赖。为应对这一挑战,作者提出了一种基于自监督学习的统一表征框架——BrainDINO,其关键在于通过在约660万张未标注的轴向脑部MRI切片上进行自蒸馏训练,构建一个可跨多种神经影像任务通用的基础模型。该模型采用冻结编码器配合轻量级任务头的方式,在无需体积重建预训练或全网络微调的前提下,实现了从肿瘤分割到脑龄估计、卒中后时间预测、分子状态预测等多个下游任务上的高效迁移,尤其在标签稀缺场景下表现显著优于自然图像和MRI专用的自监督基线方法。
链接: https://arxiv.org/abs/2604.27277
作者: Yizhou Wu,Shansong Wang,Yuheng Li,Mojtaba Safari,Mingzhe Hu,Chih-Wei Chang,Harini Veeraraghavan,Xiaofeng Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 5 figures
Abstract:Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis.
[CV-89] VTBench: A Multimodal Framework for Time-Series Classification with Chart-Based Representations
【速读】:该论文旨在解决时间序列分类(Time-series Classification, TSC)中现有深度学习模型仅依赖原始数值输入、忽视可视化表示潜力的问题。当前虽有基于纹理的编码方法(如Gramian Angular Fields, GAF 和 Recurrence Plots, RP),但其预处理复杂且可解释性弱;而图表类可视化虽具直观优势,却缺乏系统评估与有效融合机制。解决方案的关键在于提出VTBench框架,通过生成轻量级、人类可解释的图表(线图、面积图、柱状图和散点图)与原始序列进行多模态融合,构建模块化架构支持单图表视觉-数值融合、多图表视觉融合及全模态融合策略,从而在31个UCR数据集上验证了不同图表类型组合与融合方式对性能的影响,为可解释且高效的时序分类提供了统一基础。
链接: https://arxiv.org/abs/2604.27259
作者: Madhumitha Venkatesan,Xuyang Chen,Dongyu Liu
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages main text
Abstract:Time-series classification (TSC) has advanced significantly with deep learning, yet most models rely solely on raw numerical inputs, overlooking alternative representations. While texture-based encodings such as Gramian Angular Fields (GAF) and Recurrence Plots (RP) convert time series into 2D images, they often require heavy preprocessing and yield less intuitive representations. In contrast, chart-based visualizations offer more interpretable alternatives and show promise in specific domains; however, their effectiveness remains underexplored, with limited systematic evaluation across chart types, visual encoding choices, and datasets. In this work, we introduce VTBench, a systematic and extensible framework that re-examines TSC through multimodal fusion of raw sequences and chart-based visualizations. VTBench generates lightweight, human-interpretable plots – line, area, bar, and scatter, providing complementary views of the same signal. We develop a modular architecture supporting multiple fusion strategies, including single-chart visual-numerical fusion, multi-chart visual fusion, and full multimodal fusion with raw inputs. Through experiments across 31 UCR datasets, we show that: (1) chart-only models are competitive in selected settings, particularly on smaller datasets; (2) combining multiple chart types can improve accuracy by capturing complementary visual cues; and (3) multimodal models improve or maintain performance when visual features provide non-redundant information, but may degrade accuracy when they introduce redundancy. We further distill practical guidelines for selecting chart types, fusion strategies, and configurations. VTBench establishes a unified foundation for interpretable and effective multimodal time-series classification.
[CV-90] owards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany
【速读】:该论文旨在解决线状木本植被(linear woody features)在大尺度、系统性制图中面临的可迁移性和可复用性方法论挑战,尤其是在不同传感器类型、空间分辨率、数据获取条件及复杂景观异质性背景下。其解决方案的关键在于提出一个模块化工作流:第一部分是灵活的输入数据接口,能够将异构的地球观测数据统一转化为二值木本植被掩膜;第二部分是经过训练的深度神经网络,用于从掩膜中区分线状与非线状形态结构。该设计实现了仅用单一训练模型即可处理多种遥感数据源并生成国家尺度制图结果,且无需重新训练,从而提升了方法的通用性与扩展性。
链接: https://arxiv.org/abs/2604.27247
作者: Thorsten Hoeser,Verena Huber-Garcia,Sarah Asam,Ursula Gessner,Claudia Kuenzer
机构: German Aerospace Center (DLR), Earth Observation Center (EOC); University of Wuerzburg, Institute for Geography and Geology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 14 figures
Abstract:Hedges and other linear woody features provide valuable ecosystem services, particularly within intensively managed agricultural landscapes. They are key elements for climate adaptation and biodiversity amongst others not only due to a largely varying flora, but also as a feeding-, resting-, and nesting place for many animals and insects including valuable pollinators. Therefore, they require dedicated management, preservation, and attention. Thus, systematic and large-scale mapping of these features from Earth observation data is of high importance. However, transferable and reusable workflows for linear woody feature mapping remain a key methodological challenge, given the diversity of sensor types, spatial resolutions, data acquisition conditions, and complex landscape variability encountered across study areas. We introduce a modular workflow built around two independently optimizable components. Firstly, a flexible input data interface that consolidates heterogeneous Earth observation data into a binary woody vegetation mask, and secondly, a deep neural network trained to separate linear from non-linear shapes within these masks. We demonstrate the workflow by deriving three national-scale linear woody feature maps for all of Germany from three input sources by using a single trained model without retraining. Evaluation against refined reference data from four federal state biotope mapping campaigns and comparison with two existing linear woody feature maps demonstrate that the workflow produces competitive results across all evaluation sites on a national level. The modular design and its demonstrated applicability at national scale provide a foundation for scalable and generalizable linear woody feature mapping beyond Germany.
[CV-91] AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification
【速读】:该论文旨在解决人像再识别(Person Re-identification, ReID)系统在非受控场景下因性别、姿态和身体质量指数(Body Mass Index, BMI)等属性干扰而导致的公平性和泛化能力不足的问题。解决方案的关键在于引入“表达性”(expressivity)这一概念,通过构建一个辅助神经网络来量化学习特征与特定属性之间的互信息(mutual information),从而系统地分析Transformer-based ReID模型中不同属性在各层次嵌入中的编码强度。研究发现,BMI在深层特征中表达性最强,且跨光谱(如短波、中波和长波红外)条件下,姿态(pitch)的表达性显著增强,表明模型对结构线索的依赖随模态差异增大而提升,揭示了Transformer嵌入中隐含属性的层级编码机制。
链接: https://arxiv.org/abs/2604.27218
作者: Basudha Pal,Siyuan Huang,Anirudh Nanduri,Zhaoyang Wang,Rama Chellappa
机构: Johns Hopkins University (约翰霍普金斯大学); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Person re-identification (ReID) systems that match individuals across images or video frames are essential in many real-world applications. However, existing methods are often influenced by attributes such as gender, pose, and body mass index (BMI), which vary in unconstrained settings and raise concerns related to fairness and generalization. To address this, we extend the notion of expressivity, defined as the mutual information between learned features and specific attributes, using a secondary neural network to quantify how strongly attributes are encoded. Applying this framework to three transformer-based ReID models on a large-scale visible-spectrum dataset, we find that BMI consistently shows the highest expressivity in deeper layers. Attributes in the final representation are ranked as BMI Pitch Gender Yaw, and expressivity evolves across layers and training epochs, with pose peaking in intermediate layers and BMI strengthening with depth. We further extend the analysis to cross-spectral person identification across infrared modalities including short-wave, medium-wave, and long-wave infrared. In this setting, pitch becomes comparable to BMI and attribute trends increase monotonically across depth, suggesting increased reliance on structural cues when bridging modality gaps. Overall, the results show that transformer-based ReID embeddings encode a hierarchy of implicit attributes, with morphometric information persistently embedded and pose contributing more strongly under cross-spectral conditions.
[CV-92] HQ-UNet: A Hybrid Quantum-Classical U-Net with a Quantum Bottleneck for Remote Sensing Image Segmentation
【速读】:该论文旨在解决遥感图像语义分割中传统深度学习模型(如U-Net)因参数量庞大而难以高效建模复杂空间关系的问题,同时探索在近中期量子硬件约束下,如何将量子机器学习(Quantum Machine Learning, QML)有效应用于高维图像数据。其解决方案的关键在于提出一种混合量子-经典U-Net架构(HQ-UNet),在经典U-Net的瓶颈层集成一个紧凑的可训练量子电路,并引入非池化量子卷积模块,在编码器压缩特征后进行增强,从而在保持量子组件浅层和参数高效的同时,提升特征表示能力。实验表明,该方法在遥感图像分割任务上实现了优于经典U-Net的性能,验证了紧凑量子瓶颈在地球观测密集预测任务中的潜力。
链接: https://arxiv.org/abs/2604.27206
作者: Md Aminur Hossain,Ayush V. Patel,Ikshwaku Vanani,Biplab Banerjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages
Abstract:Semantic segmentation in remote sensing is commonly addressed using classical deep learning architectures such as U-Net, which require a large number of parameters to model complex spatial relationships. Quantum machine learning (QML) provides an alternative representation paradigm by mapping classical features into quantum states, but its direct application to high-dimensional images remains challenging under near-term quantum hardware constraints. In this work, we propose HQ-UNet, a hybrid quantum-classical U-Net architecture that integrates a compact parameterized quantum circuit at the bottleneck of a classical U-Net. The proposed design uses a non-pooling quantum convolutional module to enrich highly compressed encoder features before decoding, while keeping the quantum component shallow and parameter-efficient. Experiments on the this http URL dataset show that HQ-UNet achieves a mean IoU of 0.8050 and an overall accuracy of 94.76%, outperforming the classical U-Net baseline. These results suggest that compact quantum bottlenecks can enhance feature representation for remote sensing image segmentation under near-term quantum constraints. This highlights the potential of hybrid quantum-classical designs as a promising direction for parameter-efficient dense prediction in Earth observation.
[CV-93] Energy-Efficient Plant Monitoring via Knowledge Distillation
【速读】:该论文旨在解决大规模视觉表征学习模型在植物物种识别与病害识别任务中计算资源消耗高、难以部署于移动端或边缘设备的问题,从而限制了自动化生物多样性监测和精准农业系统的可扩展性。其解决方案的关键在于采用知识蒸馏(Knowledge Distillation)技术,将大型预训练模型(如视觉Transformer或多模态基础模型)的表征能力迁移至更小、高效的架构中,实验证明该方法可在保持较高性能的同时显著降低计算成本,为实际环境应用中的高效、可扩展植物识别系统提供了可行路径。
链接: https://arxiv.org/abs/2604.27178
作者: Ilyass Moummad,Reda Bensaid,Kawtar Zaher,Hervé Goëau,Jean-Christophe Lombardo,Joseph Salmon,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in large-scale visual representation learning have significantly improved performance in plant species and plant disease recognition tasks. However, state-of-the-art models, often based on high-capacity vision transformers or multimodal foundation models, remain computationally expensive and difficult to deploy in resource-constrained environments such as mobile or edge devices. This limitation hinders the scalability of automated biodiversity monitoring and precision agriculture systems, where efficiency is as critical as accuracy. In this work, we investigate knowledge distillation as an effective approach to transfer the representational capacity of large pretrained models into smaller, more efficient architectures. We focus on plant species and disease recognition, and conduct an extensive empirical study on two challenging benchmarks: Pl@ntNet300K-v2 and Deep-Plant-Disease. We evaluate four representative architectures, including two ConvNeXt models and two vision transformers, under multiple training regimes: from-scratch training and pretrained initialization, each with and without distillation. In total, we train and evaluate 70 models. Our results show that knowledge distillation consistently improves performance across tasks and architectures. Distilled models are able to match the performance of significantly larger models while maintaining substantially lower computational cost. These findings demonstrate the potential of knowledge distillation techniques to enable efficient and scalable deployment of plant recognition systems in real-world environmental applications.
[CV-94] Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
【速读】:该论文旨在解决生成式 AI (Generative AI) 在个体动物监测中的高GPU内存消耗问题,这一问题限制了其在商品化边缘加速器上的部署。解决方案的关键在于通过三阶段知识蒸馏策略压缩基于SAM 3的感知模型:首先采用基于TinyViT-21M-512构建的特征金字塔网络(Feature Pyramid Network)作为学生编码器;其次设计包含四项损失项的方向-尺度蒸馏损失函数以优化跨层特征对齐;最后引入骨干替换推理与滑动窗口会话剪枝机制,有效控制流式处理过程中的显存增长。最终实现模型参数减少7.77倍、峰值显存降低至6.49GB(原19.52GB),并可在NVIDIA Jetson Orin NX 16GB设备上运行,为边缘端长期动物行为识别与疾病关联分析提供可行路径。
链接: https://arxiv.org/abs/2604.27128
作者: Haiyu Yang,Miel Hostens
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation-model pipelines for individual-level livestock monitoring – combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings – have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is adopted as the per-individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52GB - 6.49GB), and reaches 97.34% top-1 accuracy with 91.67% macro-F1 on nine-class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed – but not yet empirically validated – on-device embedding-pool re-identification mechanism whose per-individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.
[CV-95] InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification
【速读】:该论文旨在解决文本到图像行人重识别(Text-to-image person re-identification, TI-ReID)中模型决策缺乏可解释性的问题。现有方法依赖于槽注意力(slot-attention)来突出关注区域,但无法可靠地将视觉区域与语义概念绑定,导致解释局限于受限词汇的定性可视化。其解决方案的关键在于提出InterPartAbility框架,通过引入一个轻量级、开放词汇的补丁-短语交互模块(patch-phrase interaction module, PPIM),在训练阶段提供基于概念的细粒度指导,使模型能够进行显式的部件级匹配并实现短语-区域对齐(phrase-region grounding)。此外,该方法约束CLIP ViT自注意力机制以生成空间集中的补丁激活图,从而获得可量化的解释性热力图,并设计了一套基于扰动的可解释性评估协议(如反事实区域掩码),量化说明移除高解释性区域后检索性能的下降程度,实验证明其在CUHK-PEDES和ICFG-PEDES等基准上实现了SOTA的可解释性表现,同时保持了竞争力的检索准确率。
链接: https://arxiv.org/abs/2604.27122
作者: Shakeeb Murtaza,Aryan Shukla,Rajarshi Bhattacharya,Maguelonne Heritier,Eric Granger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnoteOur code is included in the supplementary materials and will be made public. on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.
[CV-96] Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
【速读】:该论文旨在解决从稀疏观测(如单张或多张RGB-D图像)中准确重建复杂多物体场景的问题,这是计算机视觉中的核心挑战,也是实现机器人领域可扩展、可靠仿真系统的关键步骤。其解决方案的关键在于提出RecGen框架,该框架通过联合概率估计物体及其部件的形状与位姿,在遮挡和部分可见条件下仍能保持鲁棒性;其创新性地结合了组合式合成场景生成策略与强三维形状先验,从而在多样化的物体类型和真实环境中实现良好泛化能力,并在几何形状质量、纹理重建和位姿估计上显著优于此前最优方法SAM3D,且训练数据量减少近80%。
链接: https://arxiv.org/abs/2604.27106
作者: Andrii Zadaianchuk,Leonardo Barcellona,Lennard Schuenemann,Christian Gumbsch,Zehao Wang,Muhammad Zubair Irshad,Fabien Despinoy,Rahaf Aljundi,Stratis Gavves,Sergey Zakharov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Website: this https URL
Abstract:Accurately reconstructing complex full multi-object scenes from sparse observations remains a core challenge in computer vision and a key step toward scalable and reliable simulation for robotics. In this work, we introduce RecGen, a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. RecGen achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture. Despite using nearly 80% fewer training meshes than the previous state of the art SAM3D, RecGen outperforms it by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation.
[CV-97] Automated Detection of Mutual Gaze and Joint Attention in Dual-Camera Settings via Dual-Stream Transformers
【速读】:该论文旨在解决多摄像头实验室环境中自动检测亲子互动中的相互凝视(mutual gaze, MG)与共同注意(joint attention, JA)的计算挑战问题,传统方法依赖人工编码效率低下且难以扩展。其解决方案的关键在于提出一种高效的双流Transformer架构,通过冻结预训练的注视感知骨干网络(GazeLLE)提取丰富的视觉先验信息,并设计定制化的token融合机制来建模交互对之间空间与语义关系,从而实现高精度的MG和JA识别。该方法在真实生态情境的数据集上表现优异,显著优于卷积基线模型和当前先进的多模态大语言模型(multimodal Large Language Model, LLM),并开源模型与预训练权重以支持行为科学家在不同实验场景下的灵活微调。
链接: https://arxiv.org/abs/2604.27105
作者: Jakub Kosmydel,Paweł Gajewski,Arkadiusz Białek
机构: AGH University (AGH大学); Jagiellonian University (雅盖隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Analyzing mutual gaze (MG) and joint attention (JA) is critical in developmental psychology but traditionally relies on labor-intensive manual coding. Automating this process in multi-camera laboratory settings is computationally challenging due to complex cross-camera relational dynamics. In this paper, we propose a highly efficient dual-stream Transformer architecture for detecting MG and JA from synchronized dual-camera recordings. Our approach leverages frozen gaze-aware backbones (GazeLLE) to extract rich visual priors, combined with a custom token fusion mechanism to map the spatial and semantic relationships between interacting dyads. Evaluated on an ecologically valid dataset of caregiver-infant interactions, our model exhibits good performance, significantly outperforming both a convolutional baseline and a state-of-the-art multimodal Large Language Model (LLM). By open-sourcing our model and pre-trained weights, we provide behavioral scientists with a scalable tool that can be fine-tuned to diverse laboratory environments, effectively bridging the gap between computational modeling and applied interaction research.
[CV-98] Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices
【速读】:该论文旨在解决在资源受限的边缘设备上部署深度学习目标检测模型时,如何权衡模型精度(如平均精度均值 mAP)、推理速度与能耗的问题。其关键解决方案在于系统性地评估多种主流轻量化目标检测模型(包括 YOLOv8、EfficientDet Lite 和 SSD 系列)在不同边缘硬件平台(如 Raspberry Pi 3/4/5 及 Jetson Orin Nano)上的综合性能表现,明确指出低精度模型(如 SSD MobileNet V1)通常更具能效优势,而高精度模型(如 YOLOv8 Medium)虽推理较慢且耗能较高,但在使用 TPU 加速器时可显著改善性能;同时发现 Jetson Orin Nano 在请求处理效率方面最优,尽管其空闲功耗最高,为边缘智能应用中的模型-硬件协同优化提供了实证依据和选择指南。
链接: https://arxiv.org/abs/2409.16808
作者: Daghash K. Alqahtani,Aamir Cheema,Adel N. Toosi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Modern applications, such as autonomous vehicles, require deploying deep learning algorithms on resource-constrained edge devices for real-time image and video processing. However, there is limited understanding of the efficiency and performance of various object detection models on these devices. In this paper, we evaluate state-of-the-art object detection models, including YOLOv8 (Nano, Small, Medium), EfficientDet Lite (Lite0, Lite1, Lite2), and SSD (SSD MobileNet V1, SSDLite MobileDet). We deployed these models on popular edge devices like the Raspberry Pi 3, 4, and 5 with/without TPU accelerators, and Jetson Orin Nano, collecting key performance metrics such as energy consumption, inference time, and Mean Average Precision (mAP). Our findings highlight that lower mAP models such as SSD MobileNet V1 are more energy-efficient and faster in inference, whereas higher mAP models like YOLOv8 Medium generally consume more energy and have slower inference, though with exceptions when accelerators like TPUs are used. Among the edge devices, Jetson Orin Nano stands out as the fastest and most energy-efficient option for request handling, despite having the highest idle energy consumption. These results emphasize the need to balance accuracy, speed, and energy efficiency when deploying deep learning models on edge devices, offering valuable guidance for practitioners and researchers selecting models and devices for their applications.
[CV-99] Culture-inspired Multi-modal Color Palette Generation and Colorization: A Chinese Youth Subculture Case
【速读】:该论文旨在解决现有算法在色彩调色板生成与着色过程中忽视文化语境的问题,特别是针对中国青年亚文化(Chinese Youth Subculture, CYS)这一具有鲜明审美与语义特征的文化群体,其色彩使用规律与通用色彩理论存在显著差异。解决方案的关键在于构建一个受CYS启发的独特色彩数据集,并开发一种交互式多模态生成框架,该框架能够基于人类反馈(human-in-the-loop)持续优化模型输出,从而实现对图像的自动化CYS风格着色,有效融合文化语义与视觉美学。
链接: https://arxiv.org/abs/2102.05231
作者: Yufan Li,Jinggang Zhuo,Ling Fan,Harry Jiannan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by the 3rd IEEE Workshop on Artificial Intelligence for Art Creation
Abstract:Color is an essential component of graphic design, acting not only as a visual factor but also carrying cultural implications. However, existing research on algorithmic color palette generation and colorization largely ignores the cultural aspect. In this paper, we contribute to this line of research by first constructing a unique color dataset inspired by a specific culture, i.e., Chinese Youth Subculture (CYS), which is an vibrant and trending cultural group especially for the Gen Z population. We show that the colors used in CYS have special aesthetic and semantic characteristics that are different from generic color theory. We then develop an interactive multi-modal generative framework to create CYS-styled color palettes, which can be used to put a CYS twist on images using our automatic colorization model. Our framework is illustrated via a demo system designed with the human-in-the-loop principle that constantly provides feedback to our algorithms. User studies are also conducted to evaluate our generation results.
[CV-100] Physically-Informed Fuzzy Clustering of Vertical Sounding Ionograms
【速读】:该论文旨在解决电离层垂直探测(vertical sounding)中离子图(ionogram)自动分轨问题,即如何从复杂的离子图中自动识别并分离出多个传播路径(track),以支持后续的物理参数提取与解释。传统方法往往依赖人工干预或预设轨道数量,难以应对电离层扰动条件下轨道数未知的情况。解决方案的关键在于提出一种物理信息引导的模糊聚类模型(physically-informed fuzzy clustering),其核心包括:1)基于期望最大化(Expectation-Maximization, EM)算法进行聚类,并引入参数化曲线作为轨道模型(接近抛物线电离层层模型);2)每条轨道由六个参数描述(含临界频率、底层边界、半宽及三个用于考虑下层影响的附加参数);3)通过最小化改进的贝叶斯信息准则(modified Bayesian Information Criterion, BIC)自动确定最优轨道数;4)结合DBSCAN与高斯混合模型实现自适应噪声滤波,并在无硬件区分O模和X模的条件下预先去除X模式点,从而提升聚类质量。
链接: https://arxiv.org/abs/2604.27721
作者: Oleg I.Berngardt,Sergey N.Ponomarchuk
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an); Space Physics (physics.space-ph)
备注: 31 pages, 8 figures
Abstract:This paper presents a physically-informed fuzzy clustering of vertical sounding ionograms for automatically separating the ionogram into tracks suitable for further interpretation and determining their optimal number. The model is designed for use not only in conditions where the number of tracks is known, but also in disturbed ionospheric conditions where the number of tracks is preliminary unknown. The method is based on an expectation-maximization algorithm, used for clustering, and on parametrically specified distributions of distances from points to parametrically specified curves. The curves used as track models are close to model tracks in the parabolic ionospheric layer model. The resulting model of each track has six parameters: three standard ones (the critical frequency, the lower boundary of the layer, and its half-width), and three additional ones to take into account possible underlying layer effects. By sequentially increasing the number of tracks and optimizing their parameters, the model finds the optimal number of tracks on the ionogram by minimizing the modified Bayesian information criterion. The Sequential Least Squares Quadratic Programming algorithm is used to find the parameters of a single track. The width of each single track is assumed to be unknown constant found during fitting process. To improve the quality of ionogram clustering, automatic adaptive noise filtering is performed before clustering. This filtering is based on a combination of the DBSCAN and Gaussian Mixture algorithms. Also, to improve clustering quality on an ionosonde without hardware separation of the ordinary and extraordinary components, a preliminary approximate removal of points belonging to the extraordinary mode is performed. Comments: 31 pages, 8 figures Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an); Space Physics (physics.space-ph) Cite as: arXiv:2604.27721 [physics.ao-ph] (or arXiv:2604.27721v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2604.27721 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-101] An Extended Evaluation Split for DeepSpaceYoloDataset
【速读】:该论文旨在解决当前深空天体(Deep Sky Objects)检测模型在实际应用中因数据集多样性不足而导致的泛化能力有限的问题。针对这一问题,解决方案的关键在于对已有的DeepSpaceYoloDataset进行扩展,新增了一个名为test2026的测试集分割,该分割专门设计用于评估检测模型在更广泛图像多样性下的性能表现,从而提升模型在电子辅助天文观测(Electronically Assisted Astronomy)场景中的实用性和鲁棒性。
链接: https://arxiv.org/abs/2604.27593
作者: Olivier Parisot
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures
Abstract:Recent technological advances in astronomy, particularly the growing popularity of smart telescopes for the general public, make it possible to develop highly effective detection solutions that are accessible to a wide audience, rather than being reserved for major scientific observatories. Published in 2023, DeepSpaceYoloDataset is a collection of annotated images created to train YOLO-based models for detecting Deep Sky Objects, particularly suited for Electronically Assisted Astronomy. In this paper, we present an update to DeepSpaceYoloDataset with the addition of a new split, test2026, designed to evaluate detection models with a greater diversity of images.
[CV-102] A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation
【速读】:该论文旨在解决鼻气管插管(Nasotracheal Intubation, NTI)过程中视觉辅助导航中因复杂解剖环境、光照条件不佳及声门(glottis)尺度变化大导致的检测精度低与实时性差的问题。其解决方案的关键在于提出了一种专为NTI优化的轻量化声门分割框架:首先设计了一个多感受野(multi-receptive field)特征提取模块,有效减少类内差异并提升对尺度变化的鲁棒性;其次引入改进的标签分配策略并重新定义样本数量,进一步增强在复杂场景下的分割准确性;最终构建的网络在保持仅19 MB模型体积的同时,实现了超过170帧/秒的推理速度和92.9%的mDice分割指标,显著优于现有方法。
链接: https://arxiv.org/abs/2604.27383
作者: Yang Zhou,Chaoyong Zhang,Ruoyi Hao,Huilin Pan,Yang Zhang,Hongliang Ren
机构: Huazhong University of Science and Technology (华中科技大学); Hubei University of Technology (湖北工业大学); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学); Shun Hing Institute of Advanced Engineering (香港中文大学); National University of Singapore (新加坡国立大学); NUS (Suzhou) Research Institute (新加坡国立大学苏州研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures
Abstract:Nasotracheal intubation (NTI) is a critical clinical procedure for establishing and maintaining patient airway patency. Machine-assisted NTI has emerged as a pivotal approach for optimizing procedural efficiency and minimizing manual intervention. However, visual detection algorithms employed for NTI navigation encounter significant challenges, including complex anatomical environments and suboptimal illumination conditions surrounding the glottis. Additionally, the glottis presents considerable scale variability throughout the procedure, initially appearing as a small, difficult-to-capture structure before expanding to occupy nearly the entire field of view. Moreover, traditional visual detection methods often have high computational costs, making real-time, high-precision detection on portable devices challenging. To enhance NTI efficacy and address these challenges, this paper proposes a novel glottis segmentation framework optimized for vision-assisted NTI applications. First, we designed a lightweight, multi-receptive field feature extraction module to reduce intra-class differences, achieving robustness to scale variations of the glottis. This module was then stacked to form the backbone and neck of our network. Subsequently, we developed an advanced label assignment method and redefined the number of samples to further reduce intra-class differences and enhance accuracy in the complex NTI environment. Experiments on three distinct datasets demonstrate that our network surpasses state-of-the-art algorithms, achieving a segmentation mDice of 92.9% with a compact model size of 19 MB and an inference speed exceeding 170 frames per second. % Our code and datasets will be open-sourced on GitHub after the manuscript is accepted. Our code and datasets are available at this https URL.
[CV-103] Spectral Dynamic Attention Network for Hyperspectral Image Super-Resolution
【速读】:该论文旨在解决高光谱图像超分辨率(Hyperspectral Image Super-Resolution, HISR)中因光谱冗余导致的性能瓶颈以及传统前馈网络(Feed-Forward Network, FFN)非线性建模能力有限的问题。其解决方案的关键在于提出 Spectral Dynamic Attention Network (SDANet),该框架包含两个核心组件:一是动态通道稀疏注意力模块(Dynamic Channel Sparse Attention, DCSA),通过计算通道间相关性并基于数据自适应地进行稀疏化,选择性保留最具信息量的注意力响应以抑制冗余光谱交互;二是频域增强前馈网络(Frequency-Enhanced Feed-Forward Network, FE-FFN),联合建模空间域与频域特征表示,从而显著提升模型的非线性表达能力。实验表明,SDANet在保持高效性的同时实现了当前最优的HISR性能。
链接: https://arxiv.org/abs/2604.27326
作者: Tengya Zhang,Feng Gao,Lin Qi,Junyu Dong,Qian Du
机构: Sanya Oceanographic Institution, Ocean University of China (三亚海洋研究所,中国海洋大学); State Key Laboratory of Physical Oceanography, Ocean University of China (物理海洋教育部重点实验室,中国海洋大学); Department of Electrical and Computer Engineering, Mississippi State University (电气与计算机工程系,密西西比州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE GRSL 2026
Abstract:Hyperspectral image super-resolution is essential for enhancing the spatial fidelity of HSI data, yet existing deep learning methods often struggle with substantial spectral redundancy and the limited non-linear modeling capacity of standard feed-forward networks (FFNs). To address these challenges, we propose Spectral Dynamic Attention Network (SDANet), a framework designed to adaptively suppress redundant spectral interactions. SDANet integrates two key components: 1) Dynamic Channel Sparse Attention (DCSA) module that computes channel-wise correlations and selectively preserves the most informative attention responses through dynamic and data-dependent sparsification. 2) Frequency-Enhanced Feed-Forward Network (FE-FFN) that jointly models spatial and frequency-domain representations to enhance non-linear expressiveness. Extensive experiments on two benchmark datasets demonstrate that SDANet achieves state-of-the-art HISR performance while maintaining competitive efficiency. The code will be made publicly available at this https URL.
[CV-104] Representative Spectral Correlation Network for Multi-source Remote Sensing Image Classification
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)与合成孔径雷达(SAR)/激光雷达(LiDAR)数据在地表覆盖分类中融合困难的问题,主要挑战在于高维HSI中的光谱冗余以及多源数据间的异构特性。解决方案的关键在于提出一种名为代表性光谱相关网络(Representative Spectral Correlation Network, RSCNet)的新框架,其核心创新包括:(1)关键波段选择模块(Key Band Selection Module, KBSM),通过跨源引导自适应选择任务相关的光谱波段,缓解冗余并减少传统主成分分析(PCA)导致的信息损失,同时获得具有判别性的光谱结构;(2)跨源自适应融合模块(Cross-source Adaptive Fusion Module, CAFM),通过跨源注意力加权和局部-全局上下文优化增强多源特征交互,从而实现高效且精准的多源信息融合。
链接: https://arxiv.org/abs/2604.27323
作者: Chuanzheng Gong,Feng Gao,Junyan Lin,Junyu Dong,Qian Du
机构: Ocean University of China (中国海洋大学); The Hong Kong Polytechnic University (香港理工大学); Mississippi State University (密西西比州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE TGRS 2026
Abstract:Hyperspectral image (HSI) and SAR/LiDAR data offer complementary spectral and structural information for land-cover classification. However, their effective fusion remains challenging due to two major limitations: The spectral redundancy in high-dimensional HSI and the heterogeneous characteristics between multi-source data. To this end, we propose Representative Spectral Correlation Network (RSCNet), a novel multi-source image classification framework specifically designed to address the above challenges through spectral selection and adaptive interaction. The network incorporates two key components: (1) Key Band Selection Module (KBSM) that adaptively selects task-relevant spectral bands from the original HSI under cross-source guidance, thereby alleviating redundancy and mitigating information loss from conventional PCA-based spectral reduction. Moreover, the learned band subset exhibits highly discriminative spectral structures that align with discriminative semantic cues, promoting compact yet expressive representations. (2) Cross-source Adaptive Fusion Module (CAFM) that performs cross-source attention weighting and local-global contextual refinement to enhance cross-source feature interaction. Experiments on three public benchmark datasets demonstrate that our RSCNet achieves superior performance compared with state-of-the-art methods, while maintaining substantially lower computational complexity. Our codes are publicly available at this https URL.
人工智能
[AI-0] Computing Equilibrium beyond Unilateral Deviation
【速读】:该论文旨在解决博弈论中传统均衡概念(如纳什均衡和相关均衡)无法有效抵御联盟协同偏离的问题,这类均衡仅保证单个玩家无法通过单边偏离获利,但对多玩家联合偏离缺乏约束力。尽管已有强纳什均衡和联盟鲁棒均衡等概念试图提供多边稳定性,但它们往往不存在。论文提出一种替代性解决方案:不强制要求联盟偏离激励完全消失,而是最小化联盟偏离的平均收益(或加权平均、最大收益),从而在保证解存在性的前提下实现近似稳定性。其关键创新在于将问题转化为优化目标函数——最小化联盟偏离收益,并证明了平均收益与最大收益目标的计算复杂度下界,同时设计出匹配该下界的算法;此外,该框架还被用于求解“可利用性福利前沿”(Exploitability Welfare Frontier, EWF),即在给定可利用性(最大单边偏离收益)约束下的最大社会福利。
链接: https://arxiv.org/abs/2604.28186
作者: Mingyang Liu,Gabriele Farina,Asuman Ozdaglar
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
备注:
Abstract:Most familiar equilibrium concepts, such as Nash and correlated equilibrium, guarantee only that no single player can improve their utility by deviating unilaterally. They offer no guarantees against profitable coordinated deviations by coalitions. Although the literature proposes solution concepts that provide stability against multilateral deviations (\emphe.g., strong Nash and coalition-proof equilibrium), these generally fail to exist. In this paper, we study an alternative solution concept that minimizes coalitional deviation incentives, rather than requiring them to vanish, and is therefore guaranteed to exist. Specifically, we focus on minimizing the average gain of a deviating coalition, and extend the framework to weighted-average and maximum-within-coalition gains. In contrast, the minimum-gain analogue is shown to be computationally intractable. For the average-gain and maximum-gain objectives, we prove a lower bound on the complexity of computing such an equilibrium and present an algorithm that matches this bound. Finally, we use our framework to solve the \emphExploitability Welfare Frontier (EWF), the maximum attainable social welfare subject to a given exploitability (the maximum gain over all unilateral deviations).
[AI-1] LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis IJCAI ECAI2026
【速读】:该论文旨在解决脑电图(Electroencephalogram, EEG)信号中固有噪声导致图结构构建冗余或无关边的问题,从而影响图表示质量和癫痫发作检测等下游任务的性能。其解决方案的关键在于提出一个两阶段框架:首先利用基于Transformer的边预测器与多层感知机(MLP)构建初始概率化图结构,随后引入大语言模型(Large Language Model, LLM)作为边集精炼器,结合节点对的文本描述和统计特征进行语义层面的验证与优化,有效去除冗余连接,提升图结构的可解释性与任务性能。
链接: https://arxiv.org/abs/2604.28178
作者: Lincan Li,Zheng Chen,Yushun Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is accepted by the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026)
Abstract:Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging. Existing graph construction methods, whether correlation-based or learning-based, often generate redundant or irrelevant edges due to the noisy nature of EEG data. This significantly impairs the quality of graph representation and limits downstream task performance. Motivated by the remarkable reasoning and contextual understanding capabilities of large language models (LLMs), we explore the idea of using LLMs as graph edge refiners. Specifically, we propose a two-stage framework: we first verify that LLM-based edge refinement can effectively identify and remove redundant connections, leading to significant improvements in seizure detection accuracy and more meaningful graph structures. Building on this insight, we further develop a robust solution where the initial graph is constructed using a Transformer-based edge predictor and multilayer perceptron, assigning probability scores to potential edges and applying a threshold to determine their existence. The LLM then acts as an edge set refiner, making informed decisions based on both textual and statistical features of node pairs to validate the remaining connections. Extensive experiments on TUSZ dataset demonstrate that our LLM-refined graph learning framework not only enhances task performance but also yields cleaner and more interpretable graph representations.
[AI-2] Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
【速读】:该论文旨在解决现有科研基础设施以文献为中心、缺乏对方法演进关系显式建模的问题,尤其在生成式 AI (Generative AI) 驱动的研究代理(research agents)日益成为科学知识消费者的情境下,传统非结构化文本难以支持其可靠重构方法演化拓扑的局限性。解决方案的关键在于构建 Intern-Atlas——一个自动识别方法级实体、推断方法间谱系关系并捕捉推动连续创新瓶颈的方法演化图(methodological evolution graph),该图基于 103 万篇 AI 相关文献构建,包含 941 万条语义标记边,每条边均源自原始文本证据,形成可查询的因果网络;进一步提出自引导的时间树搜索算法以构造演化链,从而实现对方法随时间演进路径的精准追踪与分析。
链接: https://arxiv.org/abs/2604.28158
作者: Yujun Wu,Dongxu Zhang,Xinchen Li,Jinhang Xu,Yiling Duan,Yumou Liu,Jiabao Pan,Xuanhe Zhou,Jingxuan Wei,Siyuan Li,Jintao Chen,Conghui He,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures, 8 tables
Abstract:Existing research infrastructure is fundamentally document-centric, providing citation links between papers but lacking explicit representations of methodological evolution. In particular, it does not capture the structured relationships that explain how and why research methods emerge, adapt, and build upon one another. With the rise of AI-driven research agents as a new class of consumers of scientific knowledge, this limitation becomes increasingly consequential, as such agents cannot reliably reconstruct method evolution topologies from unstructured text. We introduce Intern-Atlas, a methodological evolution graph that automatically identifies method-level entities, infers lineage relationships among methodologies, and captures the bottlenecks that drive transitions between successive innovations. Built from 1,030,314 papers spanning AI conferences, journals, and arXiv preprints, the resulting graph comprises 9,410,201 semantically typed edges, each grounded in verbatim source evidence, forming a queryable causal network of methodological development. To operationalize this structure, we further propose a self-guided temporal tree search algorithm for constructing evolution chains that trace the progression of methods over time. We evaluate the quality of the resulting graph against expert-curated ground-truth evolution chains and observe strong alignment. In addition, we demonstrate that Intern-Atlas enables downstream applications in idea evaluation and automated idea generation. We position methodological evolution graphs as a foundational data layer for the emerging automated scientific discovery.
[AI-3] FlexiTac: A Low-Cost Open-Source Scalable Tactile Sensing Solution for Robotic Systems
【速读】:该论文旨在解决机器人末端执行器在触觉感知方面存在的高成本、难扩展及难以集成的问题,特别是缺乏一种低成本、可扩展且易于部署的压阻式触觉传感解决方案。其关键解决方案是提出FlexiTac模块,该模块由两部分组成:一是采用密封三层层压结构(FPC-Velostat-FPC)的柔性触觉传感器垫,通过将电极图案直接集成到柔性印刷电路中,显著提升制造效率与一致性,同时保持机械柔顺性以适配刚性和软性夹爪;二是紧凑型多通道读出板,利用低成本通用元器件实现100 Hz同步数据流传输至主机计算机,支持实时控制和大规模数据采集。此设计使FlexiTac可在多种平台无重大机械改造下灵活部署,并兼容现代触觉学习流程,如三维视觉-触觉融合、跨体态技能迁移及真实-仿真-真实微调。
链接: https://arxiv.org/abs/2604.28156
作者: Binghao Huang,Yunzhu Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Website: this https URL
Abstract:We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a practical “plug-in” module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi-channel readout board that streams synchronized measurements for real-time control and large-scale data collection. FlexiTac pads adopt a sealed three-layer laminate stack (FPC-Velostat-FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low-cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. Our project page is available at this https URL.
[AI-4] Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在评估过程中存在的静态性和不可验证性问题,即现有基准测试通常冻结任务集并仅评价最终响应,难以反映代理对动态工作流需求的适应能力或验证其是否真正执行了任务。解决方案的关键在于提出 Claw-Eval-Live——一个实时基准测试框架,它将可更新的外部工作流需求信号层与可复现的时间戳发布快照分离,通过公开的工作流需求信号构建可控任务集,并利用确定性检查和结构化生成式 AI(Generative AI)判断相结合的方式进行评分,从而实现对代理在真实、动态工作流中行为的可靠验证与量化评估。
链接: https://arxiv.org/abs/2604.28139
作者: Chenxin Li,Zhengyang Tang,Huangxin Lin,Yunlong Lin,Shijue Huang,Shengyuan Liu,Bowen Ye,Rang Li,Lei Li,Benyou Wang,Yixuan Yuan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.
[AI-5] Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes
【速读】:该论文旨在解决自主代理(Autonomous Agents)在沙箱容器和微虚拟机(microVMs)中运行时,因状态跨越文件系统、进程和运行时资源而导致的检查点与恢复(Checkpoint and Restore, C/R)难题。现有方法要么仅依赖应用层恢复(如保留聊天历史),忽略操作系统层面的状态变化,导致恢复不完整;要么采用每轮全量检查点,虽正确但成本过高,尤其在高密度共置场景下不可持续。其根本原因在于代理框架与操作系统之间存在语义鸿沟:代理只感知工具调用,而OS无法判断哪些状态变更对恢复有意义。该论文提出Crab系统,通过eBPF驱动的检测器识别每轮操作中对恢复相关的OS级影响,协调器将检查点对齐至回合边界并利用LLM等待时间重叠执行C/R,主机级调度引擎优化多个沙箱间的检查点流量。关键创新在于透明地弥合代理-OS语义鸿沟,显著提升恢复准确性(从8%提升至100%),减少高达87%的检查点流量,并保持接近无故障执行的性能开销(<1.9%)。
链接: https://arxiv.org/abs/2604.28138
作者: Tianyuan Wu,Chaokun Chang,Lunxi Cao,Wei Gao,Wei Wang
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注: 15 pages, 21 figures
Abstract:Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees state changes but lacks turn-level context to judge recovery relevance. This gap hides massive sparsity: over 75% of agent turns produce no recovery-relevant state, so most checkpoints are unnecessary. Crab (Checkpoint-and-Restore for Agent SandBoxes) is a transparent host-side runtime that bridges this gap without modifying agents or C/R backends. An eBPF-based inspector classifies each turn’s OS-visible effects to decide checkpoint granularity; a coordinator aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time; and a host-scoped engine schedules checkpoint traffic across co-located sandboxes. On shell-intensive and code-repair workloads, Crab raises recovery correctness from 8% (chat-only) to 100%, cuts checkpoint traffic by up to 87%, and stays within 1.9% of fault-free execution time.
[AI-6] Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
【速读】:该论文旨在解决多轮提示注入攻击(multi-turn prompt injection)中隐蔽性高、传统文本级防御易被绕过的问题,此类攻击通过“建立信任-转移-升级”三阶段路径实施,且每一轮交互在文本层面均表现为良性。其关键解决方案是识别并利用模型残差流(residual stream)中的激活层面特征——即“对抗躁动性”(adversarial restlessness),该特征表现为攻击路径各阶段切换时激活空间的显著位移,导致整体轨迹长度远超正常对话。研究发现五类捕捉此信号的标量轨迹特征可将对话级检测准确率从76.2%提升至93.8%,且该信号在四个不同规模(24B–70B参数)的模型家族中具有一致性,但探测器需针对具体架构定制。此外,实验表明训练数据来源对泛化能力至关重要:仅使用单一来源数据时检测效果有限,而融合合成数据、LMSYS-Chat-1M与SafeDialBench三源训练可实现89.4%的检测率与2.4%假阳性率,同时强调了细粒度的三阶段turn-level标签(良性/转移/对抗)对于降低误报的关键作用。
链接: https://arxiv.org/abs/2604.28129
作者: Prashant Kulkarni
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn prompt injection follows a known attack path – trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model’s residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.
[AI-7] Do Sparse Autoencoders Capture Concept Manifolds?
【速读】:该论文试图解决的问题是:当前稀疏自编码器(Sparse Autoencoders, SAEs)在提取神经网络表示中的可解释特征时,普遍假设概念对应于独立的线性方向,但越来越多的证据表明,许多概念实际上沿低维流形(low-dimensional manifolds)组织,这些流形编码了连续的几何关系。论文旨在明确SAE如何捕捉这类流形结构、何时以及以何种方式实现这一目标。其解决方案的关键在于构建一个理论框架,揭示SAEs捕捉流形的两种根本不同机制:全局方式(通过一组紧凑的原子,其线性张成包含整个流形)和局部方式(将流形分布到多个特征中,每个特征仅选择性地覆盖几何空间的一个受限区域)。研究进一步发现,现有SAE架构通常以一种称为“稀释”(dilution)的碎片化方式混合这两种机制,导致连续结构难以在单个概念层面显现,从而推动后处理无监督发现方法关注原子组而非孤立方向,以提升对几何对象的整体可解释性建模能力。
链接: https://arxiv.org/abs/2604.28119
作者: Usha Bhalla,Thomas Fel,Can Rager,Sheridan Feucht,Tal Haklay,Daniel Wurgaft,Siddharth Boppana,Matthew Kowal,Vasudev Shyam,Jack Merullo,Atticus Geiger,Ekdeep Singh Lubana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.
[AI-8] DEFault: Automated Fault Detection Categorization and Diagnosis for Transformer Architectures
【速读】:该论文旨在解决生成式 AI(Generative AI)中 Transformer 模型因注意力机制、投影层等内部组件故障导致的行为退化问题,此类故障通常不会触发运行时错误,却难以定位。现有诊断方法针对通用深度神经网络设计,无法识别具体是 Transformer 的哪个组件引发异常。解决方案的关键在于提出 DEFault++,一种分层学习驱动的诊断技术:它通过构建基于 Transformer 架构的故障传播图(Fault Propagation Graph, FPG),在组件级别测量运行时行为,并利用原型匹配与监督对比学习实现可解释诊断;同时开发了 DEFault-bench 基准数据集(包含 3,739 个标签实例),支持模型训练与评估,最终在检测、分类和根因定位任务上分别达到 AUROC > 0.96 和 Macro-F1 > 0.85,显著提升开发者修复准确率(从 57.1% 提升至 83.3%)。
链接: https://arxiv.org/abs/2604.28118
作者: Sigma Jahan,Saurabh Singh Rajput,Tushar Sharma,Mohammad Masudur Rahman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 71 pages, 15 figures, 22 tables. Preprint; under preparation for journal submission. Standalone version of Chapter 7 of the lead author’s PhD thesis (Dalhousie University, 2026). Replication package: this https URL
Abstract:Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components often degrade behavior silently without raising runtime errors. Existing fault diagnosis techniques often target generic deep neural networks and cannot identify which transformer component is responsible for an observed symptom. In this article, we present DEFault++, a hierarchical learning-based diagnostic technique that operates at three level of abstraction: it detects whether a fault is present, classifies it into one of 12 transformer-specific fault categories (covering both attention-internal mechanisms and surrounding architectural components), and identifies the underlying root cause from up to 45 mechanisms. To facilitate both training and evaluation, we construct DEFault-bench, a benchmark of 3,739 labeled instances obtained through systematic mutation testing. These instances are created across seven transformer models and nine downstream tasks using DEForm, a transformer-specific mutation technique we developed for this purpose. DEFault++ measures runtime behavior at the level of individual transformer components. It organizes these measurements through a Fault Propagation Graph (FPG) derived from the transformer architecture. It then produces an interpretable diagnosis using prototype matching combined with supervised contrastive learning. On DEFault-bench, DEFault++ exceeds an AUROC of 0.96 for detection and a Macro-F1 of 0.85 for both categorization and root-cause diagnosis on encoder and decoder architectures. In a developer study with 21 practitioners, the accuracy of choosing correct repair actions increased from 57.1% without support to 83.3% when using DEFault++.
[AI-9] Splitting Argumentation Frameworks with Collective Attacks and Supports
【速读】:该论文旨在解决复杂论证形式化系统中因表达能力增强而带来的结构分割难题,特别是针对包含支持关系的双极集合论证框架(Bipolar Set-based Argumentation Frameworks, BSAFs)的分割问题。其关键解决方案在于提出三类新颖的分割策略:基于集体攻击的分割(扩展近期针对SETAFs的分割技术)、基于集体支持的分割,以及同时针对集体攻击与支持的联合分割;并通过构建合适的分割模式并证明其在主流论证语义下的正确性,实现了对BSAFs这一更通用形式化框架的有效分解与分析。
链接: https://arxiv.org/abs/2604.28112
作者: Matti Berthold,Lydia Blümel,Giovanni Buraglio,Anna Rapberger
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Extended version of a paper presented at the 23rd International Conference on Principles of Knowledge Representation and Reasoning July 20-23, 2026 - Lisbon, Portugal, 27 pages
Abstract:This work proposes novel splitting techniques for argumentation formalisms that incorporate supports between defeasible elements. We base our studies on bipolar set-based argumentation frameworks (BSAFs) which generalize argumentation frameworks with collective attacks (SETAFs), as well as bipolar argumentation frameworks (BAFs), by incorporating both collective attacks and supports. Notably, BSAFs establish a crucial link to structured argumentation as they naturally capture general (potentially non-flat) assumption-based argumentation. The increase in expressiveness calls for diverse forms of splitting. We consider splits over collective attacks (thereby generalizing the recently proposed splitting techniques for SETAFs), splits over collective supports, as well as splits over both collective attacks and supports. We establish suitable splitting schemata and prove their correctness for the most common argumentation semantics.
[AI-10] What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial Difficult and Legible Evaluation Design
【速读】:该论文旨在解决当前终端代理基准测试(Terminal-agent benchmarks)中普遍存在的一类问题:即任务设计者将基准任务的编写方式等同于提示(prompt)设计,导致任务缺乏对抗性、可验证性和真实性,从而误导对大语言模型(Large Language Models, LLMs)编码与系统管理能力的评估。其解决方案的关键在于明确区分“提示”与“基准任务”的本质差异——提示以引导代理成功为目标,而基准任务应以识别代理真实能力为宗旨。作者提出,高质量的任务必须具备对抗性(adversarial)、难度适中且可读性强(legible),并系统归纳了包括AI生成指令、过度约束规格、文书性困难、隐含知识假设、错误验证逻辑和奖励漏洞(reward-hackable)在内的常见失败模式,强调真正的难度源于概念层面而非环境复杂度,并指出超过15%的主流终端代理基准任务存在奖励漏洞,呼吁基准维护者和研究者采用更严谨的任务设计方法。
链接: https://arxiv.org/abs/2604.28093
作者: Ivan Bercovich
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn’t. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes – AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments – are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.
[AI-11] owards Neuro-symbolic Causal Rule Synthesis Verification and Evaluation Grounded in Legal and Safety Principles
【速读】:该论文旨在解决规则系统在安全关键领域中因目标错位(goal misspecification)、可扩展性差和脆弱性导致的奖励黑客(reward hacking)及形式化验证失败问题。其核心解决方案在于引入一个元层(meta-level layer),由目标/规则合成器(Goal/Rule Synthesizer)与规则验证引擎(Rule Verification Engine)组成,通过迭代方式从人类专家提供的自然语言目标和原则中提炼出形式化的规则理论。该方法利用大语言模型(LLMs)完成目标分解、语义整合、规则翻译与因果集构建,并结合语法校验、逻辑一致性分析与安全不变量检查等机制确保规则质量,从而实现基于法律与安全原则的可追溯、模块化且增量式规则合成。
链接: https://arxiv.org/abs/2604.28087
作者: Zainab Rehan,Christian Medeiros Adriano,Sona Ghahremani,Holger Giese
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:Rule-based systems remain central in safety-critical domains but often struggle with scalability, brittleness, and goal misspecification. These limitations can lead to reward hacking and failures in formal verification, as AI systems tend to optimize for narrow objectives. In previous research, we developed a neuro-symbolic causal framework that integrates first-order logic abduction trees, structural causal models, and deep reinforcement learning within a MAPE-K loop to provide explainable adaptations under distribution shifts. In this paper, we extend that framework by introducing a meta-level layer designed to mitigate goal misspecification and support scalable rule maintenance. This layer consists of a Goal/Rule Synthesizer and a Rule Verification Engine, which iteratively refine a formal rule theory from high-level natural-language goals and principles provided by human experts. The synthesis pipeline employs large language models (LLMs) to: (1) decompose goals into candidate causes, (2) consolidate semantics to remove redundancies, (3) translate them into candidate first-order rules, and (4) compose necessary and sufficient causal sets. The verification pipeline then performs (1) syntax and schema validation, (2) logical consistency analysis, and (3) safety and invariant checks before integrating verified rules into the knowledge base. We evaluated our approach with a proof-of-concept implementation in two autonomous driving scenarios. Results indicate that, given human-specified goals and principles, the pipeline can successfully derive minimal necessary and sufficient rule sets and formalize them as logical constraints. These findings suggest that the pipeline supports incremental, modular, and traceable rule synthesis grounded in established legal and safety principles.
[AI-12] Characterizing the Consistency of the Emergent Misalignment Persona
【速读】:该论文旨在解决生成式 AI(Generative AI)在微调过程中出现的“涌现偏差”(Emergent Misalignment, EM)问题,即模型在特定狭窄领域(如不安全代码、高风险金融建议等)微调后,会泛化出广泛偏离对齐目标的行为。其解决方案的关键在于通过系统性实验设计,对 Qwen 2.5 32B Instruct 模型在六个狭义偏差域上进行微调,并结合有害性评估、自我评估、选项选择、输出识别和分数预测等多项任务,揭示了 EM 人格的两种典型模式:一致人格模型(coherent-persona models)中危害行为与自报偏差强相关;而倒置人格模型(inverted-persona models)则表现出有害输出但自我宣称对齐的矛盾特征。这一发现表明 EM 并非单一稳定人格,而是具有复杂多样性,挑战了以往对 EM 一致性假设的认知。
链接: https://arxiv.org/abs/2604.28082
作者: Anietta Weckauff,Yuchen Zhang,Maksym Andriushchenko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how consistent this correspondence is across tasks and whether it varies across fine-tuning domains. We characterize the consistency of the EM persona by fine-tuning Qwen 2.5 32B Instruct on six narrowly misaligned domains (e.g., insecure code, risky financial advice, bad medical advice) and administering experiments including harmfulness evaluation, self-assessment, choosing between two descriptions of AI systems, output recognition, and score prediction. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled, and inverted-persona models, which produce harmful outputs while identifying as aligned AI systems. These findings reveal a more fine-grained picture of the effects of emergent misalignment, calling into question the consistency of the EM persona.
[AI-13] RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM -Generated Reward Hypotheses
【速读】:该论文旨在解决生成式奖励(LLM-generated rewards)在强化学习中部署时机的不确定性问题,即如何在策略优化过程中识别并验证何时可将生成的奖励作为可靠训练目标进行部署。传统方法多关注奖励候选者的生成或筛选,而忽视了其在不同策略能力水平和训练阶段下的有效性变化。解决方案的关键在于提出一种能力感知的验证与阶段感知的部署协议(RHyVE):通过短Horizon分叉验证(short-horizon fork verification)比较共享策略检查点上的少量奖励假设(reward hypotheses),从而动态判断当前策略能力是否足以使奖励具有信息价值。实验表明,低能力策略下奖励排名不可靠,但超过任务依赖阈值后变得有用;且奖励候选池存在相位依赖性赢家切换现象,说明固定预热调度并非普适最优。因此,RHyVE本质上是一种基于验证信息的部署协议,而非通用调度器,强调奖励生成与部署应作为耦合问题共同研究。
链接: https://arxiv.org/abs/2604.28056
作者: Feiyu Wu,Xu Zheng,Zhuocheng Wang,Yi ming Dai,Hui Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focused primarily on generating, evolving, or selecting reward candidates, while paying less attention to when such candidates can be verified and deployed during policy optimization. We study this deployment-time problem by treating generated rewards as reward hypotheses whose utility depends on the competence of the current policy and the phase of training. We propose \textscRHyVE, a competence-aware verification and phase-aware deployment protocol that compares small sets of reward hypotheses from shared policy checkpoints using short-horizon fork verification. Our experiments show that reward rankings are unreliable at low competence but become informative after task-dependent thresholds. On a sparse manipulation task, phase-aware deployment improves peak and retained performance under a locked protocol. Updated LLM-generated reward-candidate experiments show candidate-family-dependent behavior: generated pools can exhibit phase-dependent winner changes, but no fixed warm-up schedule is universally optimal. Held-out schedule selection, conservative selector baselines, compute-matched controls, and scale controls further show that \textscRHyVE is best understood as a verification-informed deployment protocol rather than a universal scheduler. Dense and all-failure boundary experiments delimit the scope of the method. Together, these results suggest that reward generation and reward deployment should be studied as coupled problems: generated rewards must be verified and deployed under changing policy competence.
[AI-14] PROMISE-AD: Progression-aware Multi-horizon Survival Estimation for Alzheimers Disease Progression and Dynamic Tracking
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)个体化进展预测中面临的多重挑战,包括不规则就诊时间、数据截尾(censoring)、诊断信息泄露(diagnostic leakage)以及多时间窗风险校准问题。其核心解决方案是提出一种无泄漏的生存建模框架——PROMISE-AD,该框架将基线访视转化为包含标准化测量值、缺失掩码、纵向变化特征、时间归一化斜率、访视时机及非诊断类别属性的token表示,并通过时序Transformer融合全局、注意力池化和最新访视表征,输出进展评分与潜在离散时间混合风险模型。训练过程中结合生存似然、分时焦点风险损失、进展排序约束、风险平滑性正则和混合平衡正则,最终在验证集上采用等距校准(isotonic calibration)实现1年、2年、3年和5年风险的精准估计。该方法在CN→MCI和MCI→AD转换任务中均取得领先性能,尤其在MCI→AD预测中展现出接近天花板的5年判别能力(AUROC 0.997),证明了进展感知型生存建模在可解释多时间窗AD转化风险评估中的有效性。
链接: https://arxiv.org/abs/2604.28055
作者: Qing Lyu,Jeremy Hudson,Mohammad Kawas,Yuming Jiang,Chenyu You,Christopher T Whitlow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Individualized Alzheimer’s disease (AD) progression prediction requires models that use irregular visits, account for censoring, avoid diagnostic leakage, and provide calibrated horizon risks. We propose PROgression-aware MultI-horizon Survival Estimation for Alzheimer’s Disease (PROMISE-AD), a leakage-safe survival framework for predicting conversion from cognitively normal (CN) to mild cognitive impairment (MCI) and from MCI to AD dementia using ADNI/TADPOLE tabular histories. PROMISE-AD converts pre-index visits into tokens with standardized measurements, missingness masks, longitudinal changes, time-normalized slopes, visit timing, and non-diagnostic categorical attributes. A temporal Transformer fuses global, attention-pooled, and latest-visit representations to estimate a progression score and latent discrete-time mixture hazards. Training combines survival likelihood, horizon-specific focal risk loss, progression ranking, hazard smoothness, and mixture-balance regularization, followed by validation-set isotonic calibration for 1-, 2-, 3-, and 5-year risks. In held-out testing across three seeds, PROMISE-AD achieved an integrated Brier score (IBS) of 0.085 \pm 0.012, C-index of 0.808 \pm 0.015, and mean time-dependent AUC of 0.840 \pm 0.081 for CN-to-MCI conversion, yielding the lowest IBS among compared methods. For MCI-to-AD conversion, PROMISE-AD achieved the highest C-index (0.894 \pm 0.018) and near-ceiling 5-year discrimination (AUROC 0.997 \pm 0.003; AUPRC 0.999 \pm 0.001), although some baselines had lower IBS. Ablations and interpretability supported longitudinal change features, fused temporal representations, mixture hazards, cognitive and functional measures, APOE4 status, and recent conversion-proximal visits. These findings suggest that progression-aware survival modeling can provide interpretable multi-horizon AD conversion risk estimates.
[AI-15] o Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems
【速读】:该论文旨在解决当前负责任人工智能(Responsible AI)研究中对AI系统开发前期决策关注不足的问题,尤其是缺乏对AI非开发(non-development)和中途放弃(abandonment)现象的系统性理解。其关键解决方案在于通过整合学术文献、民间社会资源与灰色文献(如新闻报道和行业报告)进行主题分析,构建了一个包含六类因素的分类体系:伦理关切、利益相关者反馈、开发生命周期挑战、组织动态、资源限制及法律/监管问题;并进一步利用AI事件数据库与从业者调查收集实证数据,比较部署前与部署后导致AI项目终止的关键驱动因素。结果表明,除伦理风险外,诸多非伦理因素同样显著影响组织对AI开发的取舍,从而揭示了现有负责任AI研究在干预节点和实践支持上的盲区,并提出应拓展研究视野以覆盖更广泛的决策杠杆,提升对AI系统开发“不启动”或“中止”的机制性理解与支持能力。
链接: https://arxiv.org/abs/2604.28053
作者: Shreya Chappidi,Jatinder Singh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted to ACM FAccT 2026
Abstract:Responsible AI research typically focuses on examining the use and impacts of deployed AI systems. Yet, there is currently limited visibility into the pre-deployment decisions to pursue building such systems in the first place. Decisions taken in the earlier stages of development shape which systems are ultimately released, and therefore represent potential, but underexplored, points for intervention. As such, this paper investigates factors influencing AI non-development and abandonment throughout the development lifecycle. Specifically, we first perform a scoping review of academic literature, civil society resources, and grey literature including journalism and industry reports. Through thematic analysis of these sources, we develop a taxonomy of six categories of factors contributing to AI abandonment: ethical concerns, stakeholder feedback, development lifecycle challenges, organizational dynamics, resource constraints, and legal/regulatory concerns. Then, we collect data on real-world case of AI system abandonment via an AI incident database and a practitioner survey to evidence and compare factors that drive abandonment both prior to and following system deployment. While academic responsible AI communities often emphasize ethical risks as reasons to not develop AI, our empirical analysis of these cases demonstrates the diverse, and often non-ethics-related, levers that motivate organizations to abandon AI development. Synthesizing evidence from our taxonomy and related case study analyses, we identify gaps and opportunities in current responsible AI research to (1) engage with the diverse range of levers that influence organizations to abandon AI development, and (2) better support appropriate (dis)engagement with AI system development.
[AI-16] Agent -Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems
【速读】:该论文旨在解决生产环境中文本到SQL(Text-to-SQL, T2SQL)模型评估缺乏有效方法的问题。现有基准测试通常依赖于真实查询(ground-truth queries)和结构化数据库模式(schema),而这些条件在实际部署中往往不可获得,导致T2SQL代理在生产中的质量难以持续监控与改进。解决方案的关键在于提出一种无模式(schema-agnostic)的评估框架STE F(Schema-agnostic Text-to-SQL Evaluation Framework),其仅基于用户自然语言问题、增强重述版本及生成的SQL语句进行评估,无需访问数据库模式或参考查询。STE F通过提取自然语言与SQL的语义规范、执行归一化特征对齐,并结合过滤器对齐、语义判断和评估置信度三个维度生成0–100的可解释准确率分数,从而实现无需依赖schema的持续生产级监控与反馈闭环。
链接: https://arxiv.org/abs/2604.28049
作者: Taslim Jamal Arif,Kuldeep Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.
[AI-17] Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents Agents with Subject Matter Experts Developers and Helper Agents
【速读】:该论文旨在解决科学领域中大型语言模型(Large Language Model, LLM)代理开发过程中缺乏系统性和可复用性的问题,即传统试错式方法难以保证代理行为的规范性、可验证性和可维护性。其解决方案的关键在于提出一种结构化的工程方法——协作代理推理工程(Collaborative Agent Reasoning Engineering, CARE),通过三阶段门控流程(stage-gated phases)和可复用的工程产物(如交互需求、推理策略和评估标准),将领域专家(Subject-Matter Experts, SMEs)、开发者与LLM辅助代理协同工作,实现从非结构化领域意图到结构化可审查规格的转化,从而提升复杂查询处理能力和开发效率。
链接: https://arxiv.org/abs/2604.28043
作者: Rahul Ramachandran,Nidhi Jha,Muthukumaran Ramasubramanian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches, CARE specifies behavior, grounding, tool orchestration, and verification through reusable artifacts and systematic, stage-gated phases. The methodology employs a three-party workflow involving Subject-Matter Experts (SMEs), developers, and LLM-based helper agents. These helper agents function as facilitation infrastructure, transforming informal domain intent into structured, reviewable specifications for human approval at defined gates. CARE addresses the “jagged technological frontier”, characterized by uneven LLM performance, by bridging the gap between novice and expert analysts regarding domain constraints and verification practices. By generating concrete artifacts, including interaction requirements, reasoning policies, and evaluation criteria, CARE ensures agent behavior is specifiable, testable, and maintainable. Evaluation results from a scientific use case demonstrate that this stage-gated, artifact-driven methodology yields measurable improvements in development efficiency and complex-query performance.
[AI-18] SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学光谱理解任务中面临的挑战,尤其是光谱图像(Spectra)因其非结构化和领域特异性带来的高信息密度问题。解决方案的关键在于提出一个名为SpecVQA的专业科学图像基准,涵盖7种代表性光谱类型,并包含由专家标注的620幅图与3100个问答对,以同时评估直接信息提取和领域特定推理能力;此外,为降低token长度并保留关键曲线特征,作者设计了一种光谱数据采样与插值重构方法,实验证明该方法显著提升了模型性能。
链接: https://arxiv.org/abs/2604.28039
作者: Jialu Shen,Han Lyu,Suyang Zhong,Hanzheng Li,Haoyi Tao,Nan Wang,Changhong Chen,Xi Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a professional scientific-image benchmark for evaluating multimodal models on scientific spectral understanding, covering 7 representative spectrum types with expert-annotated question-answer pairs. The aim comprises two aspects: spectra scientific QA evaluation and corresponding underlying task evaluation. SpecVQA contains 620 figures and 3100 QA pairs curated from peer-reviewed literature, targeting both direct information extraction and domain-specific reasoning. To effectively reduce token length while preserving essential curve characteristics, we propose a spectral data sampling and interpolation reconstruction approach. Ablation studies further confirm that the approach achieves substantial performance improvements on the proposed benchmark. We test the capability of prominent MLLMs in scientific spectral understanding on our benchmark and present a leaderboard. This work represents an essential step toward enhancing spectral understanding in multimodal large models and suggests promising directions for extending visual-language models to broader scientific research and data analysis.
[AI-19] MIFair: A Mutual-Information Framework for Intersectionality and Multiclass Fairness
【速读】:该论文旨在解决机器学习中公平性评估与缓解的挑战,特别是由于伦理复杂性、缺乏统一定义以及需要情境特定的偏差度量所导致的问题。现有方法在处理交叉性(intersectionality)、多类别场景及灵活性和通用性方面仍存在局限。其解决方案的关键在于提出MIFair框架,该框架基于互信息(mutual information)构建统一的公平性评估与缓解机制,通过定义预测变量与敏感属性之间的统计独立性来实现群体公平性,并借助正则化训练策略最小化选定度量下的偏差。MIFair的优势在于其高度灵活性和一致性:能够将多种公平性需求整合为单一框架,支持交叉性、复杂子群结构和多分类任务,从而简化实际应用并提升基准测试的一致性。
链接: https://arxiv.org/abs/2604.28030
作者: Jeanne Monnier,Thomas George,Frédéric Guyard,Christèle Tarnec,Marios Kountouris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Theory (cs.IT)
备注:
Abstract:Fairness in machine learning remains challenging due to its ethical complexity, the absence of a universal definition, and the need for context-specific bias metrics. Existing methods still struggle with intersectionality, multiclass settings, and limited flexibility and generality. To address these gaps, we introduce MIFair, a unified framework for bias assessment and mitigation based on mutual information. MIFair provides a flexible metric template and an in-processing mitigation method inspired by the Prejudice Remover, defining group fairness as statistical independence between prediction-derived variables and sensitive attributes. We further strengthen its information-theoretic foundation by establishing equivalences with widely used fairness notions such as independence and separation. MIFair naturally supports intersectionality, complex subgroup structures, and multiclass classification and employs regularization-based training to reduce bias according to the selected metric. Its key advantage is its versatility: it consolidates diverse fairness requirements into a single coherent framework, enabling consistent benchmarking and simplifying practical use. Experiments on real-world tabular and image datasets show that MIFair effectively reduces bias, including previously unaddressed multi-attribute scenarios, while maintaining strong predictive performance across the evaluated settings.
[AI-20] Design Structure Matrix Modularization with Large Language Models
【速读】:该论文旨在解决设计结构矩阵(Design Structure Matrix, DSM)模块化问题,即如何将系统元素高效地划分为具有高内聚性与低耦合性的模块,这是工程设计中的一个基础组合优化挑战。传统方法通常将其视为纯图优化问题,忽略了嵌入在系统中的工程上下文信息。论文提出了一种基于大语言模型(Large Language Model, LLM)的新型解决方案,通过扩展先前用于DSM排序的LLM方法至五个案例和三种主流LLM架构,在无需编写专门优化代码的情况下,仅用30次迭代即可达到接近参考解的质量。其关键创新在于识别出:在复杂DSM中,领域知识反而会损害性能,这归因于LLM的功能先验与纯粹结构优化目标之间的语义错位,并据此提出了可验证的“语义对齐假设”来解释知识有效性边界。此外,消融实验进一步明确了最优输入表示、目标函数设定及解池设计策略,为LLM在工程设计优化中的实际部署提供了实证依据。
链接: https://arxiv.org/abs/2604.28018
作者: Shuo Jiang,Jianxi Luo
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Design Structure Matrix (DSM) modularization, the task of partitioning system elements into cohesive modules, is a fundamental combinatorial challenge in engineering design. Traditional methods treat modularization as a pure graph optimization, without access to the engineering context embedded in the system. Building on prior work on LLM-based combinatorial optimization for DSM sequencing, this paper extends the method to modularization across five cases and three backbone LLMs. Our method achieves near-reference quality within 30 iterations without requiring specialized optimization code. Counterintuitively, domain knowledge, beneficial in sequencing, consistently impairs performance on more complex DSMs. We attribute this to semantic misalignment between the LLM’s functional priors and the purely structural optimization objective, and propose the semantic-alignment hypothesis as a testable condition governing knowledge effectiveness with LLMs. Ablation studies identify the most effective input representation, objective formulation, and solution pool design for practical deployment. These findings offer practical guidance for deploying LLMs in engineering design optimization.
[AI-21] Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care
【速读】:该论文旨在解决临床AI系统在实际应用中因医生对AI建议的否决(override)而难以持续优化的问题,尤其关注如何利用这些否决行为作为隐式偏好数据来改进模型性能。其核心挑战在于传统偏好学习方法无法充分捕捉临床决策中的多维因素(如患者状态、组织环境和医生能力),且易受“抑制偏差”(suppression bias)影响——即当医生执行能力不足时,系统会系统性忽略正确但困难的推荐。解决方案的关键在于提出一个扩展的偏好学习框架:首先构建五类否决分类体系以映射不同类型的干预目标;其次引入以患者状态 s、组织上下文 c 和医生能力 κ(分解为执行能力 κexec 与对齐能力 κalign)为条件的偏好建模;最后设计一种双学习架构,通过交替优化奖励模型和能力模型,避免抑制偏差并实现与患者长期轨迹一致的奖励函数学习。该框架特别适用于基于结果支付的慢性病管理场景,因其具备纵向数据密度高、决策空间集中、结果标签明确及自然能力变异等优势。
链接: https://arxiv.org/abs/2604.28010
作者: Prabhjot Singh,Abhishek Gupta,Chris Betz,Abe Flansburg,Brett Ives,Sudeep Lama,Jung Hoon Son
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 2 tables, 1 figure
Abstract:We reframe clinician overrides of clinical AI recommendations as implicit preference data - the same signal structure exploited by reinforcement learning from human feedback (RLHF), but richer: the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable. We present a formal framework extending standard preference learning with three contributions: a five-category override taxonomy mapping override types to distinct model update targets; a preference formulation conditioned on patient state s, organizational context c, and clinician capability kappa, where kappa decomposes into execution capability kappa-exec and alignment capability kappa-align; and a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, preventing a failure mode we term suppression bias-the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold. We argue that chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties-longitudinal density, concentrated decision space, outcome labels, and natural capability variation-and that training environments combining longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics. This framework emerged from operational work to improve clinician capability in a live value-based care deployment.
[AI-22] A Pattern Language for Resilient Visual Agents
【速读】:该论文旨在解决将多模态基础模型(multimodal foundation models)集成到企业生态系统时面临的软件架构挑战,核心矛盾在于视觉语言动作模型(VLA)的高延迟与非确定性难以满足企业控制回路对确定性和实时性能的要求。解决方案的关键在于提出一种面向视觉代理(visual agents)的架构模式语言(architectural pattern language),通过分离快速、确定性的反射行为与慢速、概率性的监督机制,实现系统质量属性的平衡;其核心包含四个设计模式:混合可操作性集成(Hybrid Affordance Integration)、自适应视觉锚定(Adaptive Visual Anchoring)、视觉层次合成(Visual Hierarchy Synthesis)以及语义场景图(Semantic Scene Graph)。
链接: https://arxiv.org/abs/2604.28001
作者: Habtom Kahsay Gidey,Alexander Lenz,Alois Knoll
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to the 23rd International Conference on Software Architecture (ICSA 2026), New and Emerging Ideas Track. 5 pages, 1 figure
Abstract:Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and non-determinism of vision language action (VLA) models versus the strict determinism and real-time performance required by enterprise control loops. In this study, we propose an architectural pattern language for visual agents that separates fast, deterministic reflexes from slow, probabilistic supervision. It consists of four architectural design patterns: (1) Hybrid Affordance Integration, (2) Adaptive Visual Anchoring, (3) Visual Hierarchy Synthesis, and (4) Semantic Scene Graph.
[AI-23] ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测中Transformer架构计算复杂度高、参数冗余的问题,同时提升模型对跨样本全局依赖关系的建模能力与自适应正则化效果。其解决方案的关键在于提出一种全MLP(多层感知机)框架ITS-Mina,包含三项核心创新:(1) 通过迭代精炼机制(iterative refinement mechanism)重复应用共享参数的残差混合器堆栈,在不显著增加参数量的前提下逐步增强时间表示;(2) 引入外部注意力模块(external attention module),以可学习的记忆单元替代传统自注意力机制,实现线性复杂度下的跨样本全局依赖捕捉;(3) 利用哈里斯鹰优化算法(Harris Hawks Optimization, HHO)自动调优Dropout率,实现针对不同数据集的自适应正则化策略。
链接: https://arxiv.org/abs/2604.27981
作者: Pourya Zamanvaziri,Amirhossein Sadr,Aida Pakniyat,Dara Rahmati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, 3 tables, 4 algorithms
Abstract:Multivariate time series forecasting plays a pivotal role in numerous real-world applications, including financial analysis, energy management, and traffic planning. While Transformer-based architectures have gained popularity for this task, recent studies reveal that simpler MLP-based models can achieve competitive or superior performance with significantly reduced computational cost. In this paper, we propose ITS-Mina, a novel all-MLP framework for multivariate time series forecasting that integrates three key innovations: (1) an iterative refinement mechanism that progressively enhances temporal representations by repeatedly applying a shared-parameter residual mixer stack, effectively deepening the model’s computational capacity without multiplying the number of distinct parameters; (2) an external attention module that replaces traditional self-attention with learnable memory units, capturing cross-sample global dependencies at linear computational complexity; and (3) a Harris Hawks Optimization (HHO) algorithm for automatic dropout rate tuning, enabling adaptive regularization tailored to each dataset. Extensive experiments on six widely-used benchmark datasets demonstrate that ITS-Mina achieves state-of-the-art or highly competitive performance compared to eleven baseline models across multiple forecasting horizons.
[AI-24] D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
【速读】:该论文旨在解决科学数据驱动发现领域中语言模型与智能体(Agent)能力提升受限于缺乏可验证环境的问题。现有研究虽在生成式 AI(Generative AI)和智能体方面取得进展,但因缺少真实科学场景下具备明确评估逻辑的可执行环境,导致模型训练与评测缺乏可靠依据。解决方案的关键在于提出 D3-Gym——首个自动构建的、包含可验证环境的科学数据驱动发现基准数据集,其核心特征包括:565 个来自 239 个真实科学仓库的跨学科任务,每个任务均配有自然语言指令、预装依赖的可执行环境、输入数据集与结果预览、参考代码解法及自动合成的评估脚本。实证表明,这些评估脚本与人工标注黄金标准一致率达 87.5%,且在领域特定逻辑上高度对齐,从而为科学智能体提供了高质量的验证信号,并显著提升了 Qwen3 系列模型在 ScienceAgentBench 上的表现。
链接: https://arxiv.org/abs/2604.27977
作者: Hanane Nour Moussa,Yifei Li,Zhuoyang Li,Yankai Yang,Cheng Tang,Tianshu Zhang,Nesreen K. Ahmed,Ali Payani,Ziru Chen,Huan Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific this http URL fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at this https URL.
[AI-25] From Mirag e to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在将电路图(circuit diagram)转化为寄存器传输级(Register-Transfer-Level, RTL)代码时存在的“幻影缺陷”(Mirage)问题,即模型在面对空白图像时仍能保持高准确率,其根本原因在于模型依赖模块头中的标识符语义而非视觉输入进行推理,导致生成结果缺乏真正的视觉感知能力。解决方案的关键在于提出VeriGround(4B参数规模),通过三重机制实现可靠视觉接地:1)训练阶段对标识符进行匿名化处理以消除语义偏倚;2)引入拒绝增强(refusal augmentation)提升模型在无法确定时的拒绝能力;3)采用决策导向的ORPO(D-ORPO)偏好对齐方法,强化生成或拒绝决策点上的关键token权重。实验表明,VeriGround在正常模式下功能性通过率(Functional Pass@1)达46.11%,匿名模式下仍保持42.51%,显著优于其他基线模型,且虚假拒绝率极低,验证了其对视觉输入的真实依赖性与可靠性。
链接: https://arxiv.org/abs/2604.27969
作者: Guang Yang,Xing Hu,Xiang Chen,Xin Xi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) are increasingly used to translate visual artifacts into code, from UI mockups into HTML to scientific plots into Python scripts. A circuit diagram can be viewed as a visual domain-specific language for hardware: it encodes timing, topology, and bit level semantics that are invisible to casual inspection yet safety critical once fabricated in silicon. Translating such diagrams into register-transfer-level(RTL) code therefore represents an extreme reliability test for vision-to-code generation. We reveal a phenomenon we call Mirage: replacing a circuit diagram with a blank image leaves Pass@k unchanged or even higher, because models bypass the visual input and instead exploit identifier semantics in the module header to retrieve canonical RTL templates. This constitutes a new, highly covert class of defect in AI-assisted code generation that directly undermines MLLMs’ trustworthiness. To quantify the effect, we construct C2VEVAL and evaluate eight MLLMs under a paired Normal/Anony protocol in which Anony mode anonymizes all identifiers in both the diagram and the module header; Anony-mode scores drop sharply across all models, confirming that high Normal-mode accuracy is largely a Mirage. We then propose VeriGround (4B), trained with identifier anonymization, refusal augmentation, and D-ORPO (Decision-Focused ORPO) preference alignment that up-weights pivotal generate-or-refuse tokens. VeriGround achieves Functional Pass@1 of 46.11%/42.51%(Normal/Anony) with a False Refusal Rate of only 1.20%/0.00%, while maintaining 92% Refusal Rate on blank images. With only 4B parameters, VeriGround performs on par with GPT-5.4 under Normal and significantly outperforms all baselines under Anony, confirming genuine visual grounding.
[AI-26] Splitting Assumption-Based Argumentation Frameworks KR2025
【速读】:该论文旨在解决假设基论证(Assumption-Based Argumentation, ABA)中核心推理任务的高计算复杂性问题,尤其是在ABA框架(ABAFs)被实例化为基于图的论证形式如Dung的论证框架(AFs)和带有集体攻击的论证框架(SETAFs)时,这种复杂性进一步加剧。传统方法如分割(splitting)虽在AFs中成功应用,但其在ABAFs中的直接应用可能因实例化导致的指数级增长而失效。论文的关键解决方案是将分割策略从图结构层面转移到知识库层面,并提出适用于ABAFs的参数化分割方法,从而在不依赖图实例化的情况下实现更高效的推理优化。
链接: https://arxiv.org/abs/2604.27964
作者: Giovanni Buraglio,Wolfgang Dvorak,Stefan Woltran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at KR 2025
Abstract:Assumption-Based Argumentation (ABA) is a well-established formalism for modelling and reasoning over debates, with a wide range of applications. However, the high computational complexity of core reasoning tasks in ABA poses a significant challenge for its applicability. This issue is further aggravated when ABA frameworks (ABAFs) are instantiated into graph-based argumentation formalisms, such as Dung’s Argumentation Frameworks (AFs) and Argumentation Frameworks with Collective Attacks (SETAFs). In knowledge representation and reasoning, a key strategy to address computational intractability is to optimise reasoning over a given knowledge base through divide-and-conquer algorithms. A paradigmatic example of this approach is splitting, where extensions of a given framework are computed incrementally, by restricting the search space to sub-frameworks only, and then combining the obtained results. This approach has been successfully applied to AFs, for which also a parametrised version has been introduced under stable semantics. However, the exponential growth produced by the instantiation might undermine the usefulness of splitting on the argument graphs induced by ABAFs. To address this issue, our work investigates the concept of splitting on the knowledge base rather than on its graph-based instantiation. Furthermore, we generalise splitting to its parametrised version for ABAFs.
[AI-27] LLM s as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中面临的高计算成本、逻辑不一致性和性能急剧下降的问题。现有基于单调逻辑(如SMT)的神经符号方法无法建模人类认知中至关重要的默认推理(defeasible reasoning),限制了其在现实场景中的应用。解决方案的关键在于提出“LLM+ASP”框架,该框架将自然语言自动转化为答案集编程(Answer Set Programming, ASP)——一种基于稳定模型语义(stable model semantics)的非单调形式化系统,从而天然支持默认规则与例外处理;同时引入自动化自我修正循环机制,利用ASP求解器提供的结构化反馈实现迭代优化,无需任何任务特定工程设计即可在多种推理任务上取得显著性能提升。
链接: https://arxiv.org/abs/2604.27960
作者: Adam Ishay,Joohyung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages
Abstract:Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning – essential components of human cognition. We present “LLM+ASP,” a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior “LLM+ASP” approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a “context rot” phenomenon where excessive context hinders constraint adherence.
[AI-28] Attractor FCM
【速读】:该论文旨在解决传统模糊认知图(Fuzzy Cognitive Map, FCM)在建模动态系统时收敛性差、误差优化不精确以及缺乏物理约束等问题。其解决方案的关键在于提出一种基于梯度下降、受物理规律约束的雅可比(Jacobian)版本FCM,该模型通过引入残差记忆(residual memory)、时间反向传播(back propagation through time)和固定点锚定机制(fixed point anchor),实现权重的递归更新与系统记忆保持;同时采用牛顿法求解系统的固定点吸引子,并结合自适应项动态调整梯度下降路径,以避免因Sigmoid饱和导致的局部极小值早熟收敛;此外,利用因果掩码(causal mask)对更新过程进行过滤,确保网络遵循初始专家知识所定义的物理规则,从而高效降低目标误差。
链接: https://arxiv.org/abs/2604.27947
作者: Alexis Kafantaris
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:In this paper an attractor FCM is created, tested, and analyzed. This FCM is neither a hebbian based nor agentic, nor a hybrid; it rather is a gradient descent based, physics constrained, Jacobian version of an FCM. Moreover, this model has several quirks; it uses residual memory, back propagation through time, and a fixed point anchor that is recursively implemented to update its weights. The residuals update the recursive part without losing the system memory. The model’s anchor enables it to converge in a fixed point for which back propagation through time unrolls it and ensures that the error minimization is for an accurate gradient. Furthermore, a new learning algorithm is utilized. The Newton’s method finds the system’s fixed point attractor and then gradient descend is adaptively changing the landscape; an adaptive term is used to directly manipulate the weights through the attractor dynamics. As the adaptive term changes, the descent through the landscape is constantly adjusting according to sigmoid saturation, and that prevents premature convergence to a local minimum. Lastly, the updates are filtered by causal mask that informs the network about the physics, respecting the initial expert based opinions, for which model reduces the error to the target in an efficient way.
[AI-29] A Collective Variational Principle Unifying Bayesian Inference Game Theory and Thermodynamics
【速读】:该论文试图解决多智能体系统中集体智能(Collective Intelligence)的统一理论框架问题,即如何在无中心协调的情况下,从个体自适应行为中涌现出协同决策与博弈均衡。其解决方案的关键在于提出“博弈论自由能原理”(Game-Theoretic Free Energy Principle),通过将个体局部自由能最小化与随机博弈(Stochastic Game)结构相联系,证明在有限理性与局部信息约束下,集体自由能的稳态点对应于近似纳什均衡(Nash Equilibrium)。进一步地,该框架揭示了合作博弈可被表述为变分形式,其中均衡状态表现为联盟上的吉布斯分布(Gibbs Distribution),从而建立起贝叶斯推断与策略互动之间的桥梁,并引入哈桑尼红利(Harsanyi Dividend)的自由能表述以刻画高阶多智能体协同效应,最终提供了一个可验证的关于感知精度与代理影响力非单调关系的预测模型,适用于神经、生物和人工多智能体系统。
链接: https://arxiv.org/abs/2604.27942
作者: Djamel Bouchaffra,Faycal Ykhlef,Mustapha Lebbah,Hanane Azzag
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Nature. 21 pages, 4 figures. Code and data available at this https URL
Abstract:Collective intelligence emerges across biological, physical, and artificial systems without central coordination, yet a unifying principle governing such behaviour remains elusive. The Free Energy Principle explains how individual agents adapt through variational inference, while game theory formalises strategic interactions. Here we introduce the Game-Theoretic Free Energy Principle, a unified framework showing that multi-agent systems performing local free-energy minimisation implicitly implement a stochastic game. We prove that, under bounded rationality and local information constraints, stationary points of collective free energy correspond to approximate Nash equilibria of an induced game. Conversely, a broad class of cooperative games admits a variational representation in which equilibria arise as Gibbs distributions over coalitions, establishing a bridge between Bayesian inference and strategic interaction. To characterise higher-order effects, we introduce a free-energy formulation of the Harsanyi dividend, isolating irreducible multi-agent synergy. This yields a predictive theory of cooperation, including a falsifiable non-monotonic relationship between sensory precision and agent influence. We validate this prediction across neural, biological, and artificial multi-agent systems. These results identify a common variational principle underlying inference, thermodynamics, and game-theoretic equilibrium.
[AI-30] aming the Centaur(s) with LAPITHS: a framework for a theoretically grounded interpretation of AI performances
【速读】:该论文旨在解决当前人工智能研究中过度依赖行为主义范式,将基于Transformer的生成式AI(Generative AI)在人类水平任务上的表现误认为是具备类人认知能力的问题。其解决方案的关键在于提出LAPITHS框架,该框架通过两个定量评估手段提供理论与实证依据:一是“最小认知网格”(Minimal Cognitive Grid),用于量化评估人工系统认知合理性;二是行为对比实验,证明类似CENTAUR模型的结果可由不满足认知结构约束且缺乏独立解释力的其他系统复现,从而质疑现有模型的认知真实性。
链接: https://arxiv.org/abs/2604.27927
作者: Matteo Da Pelo,Alessio Donvito,Claudio Frongia,Pietro Salis,Antonio Lieto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages
Abstract:We introduce a framework called LAPITHS (Language model Analysis through Paradigm grounded Interpretations of Theses about Human likenesS) and use it to show that several major claims advanced by models such as CENTAUR, proposed as an artificial Unified Model of Cognition, are not theoretically or empirically justified. LAPITHS provides a principled reference point for counteracting the current behaviouristic tendency in AI research to interpret the human level performances of transformer based language models as evidence of human like underlying computation and, by extension, as signs of cognitive abilities. The novelty of LAPITHS lies in making explicit the arguments grounded in two quantitative assessments: (i) the Minimal Cognitive Grid, a theoretically motivated method for estimating the cognitive plausibility of artificial systems, and (ii) a behavioural comparison showing that results similar to those reported for CENTAUR like models can be reproduced by other systems that do not satisfy the structural constraints typically associated with cognitive plausibility, and whose outputs do not provide independent explanatory insight into human cognition.
[AI-31] Simulating clinical interventions with a generative multimodal model of human physiology
【速读】:该论文旨在解决医学中个体健康状态随时间变化的机制不明确,以及对干预措施响应存在个体差异的核心挑战。其解决方案的关键在于提出HealthFormer——一种基于Transformer架构的生成式模型,通过在超过15,000名深度表型个体的多访视数据上训练,将每位参与者跨667项生理指标(涵盖血液生物标志物、体成分、睡眠生理学、连续葡萄糖监测、肠道微生物组、可穿戴设备衍生生理参数及行为与药物暴露等七个领域)的时间轨迹建模为序列token,并以生成式目标进行预测。该模型无需任务特定微调即可迁移至多个独立队列,在27/30种新发疾病和死亡终点预测中表现优于现有临床风险评分;同时支持无创模拟干预效果,例如在个性化营养试验中准确预测个体六个月内生物标志物变化(如舒张压的皮尔逊相关系数r=0.78),并在41个随机对照试验结果中全部匹配干预效应方向,且30例预测均值落入报告95%置信区间。因此,HealthFormer被定位为首个健康世界模型,使预测、风险分层和干预条件模拟成为统一查询任务,为构建临床数字孪生奠定基础。
链接: https://arxiv.org/abs/2604.27899
作者: Guy Lutsker,Gal Sapir,Jordi Merino,Smadar Shilo,Anastasia Godneva,Eli Meirom,Shie Mannor,Hagai Rossman,Gal Chechik,Eran Segal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how human health changes over time, and why responses to interventions vary between individuals, remains a central challenge in medicine. Here we present HealthFormer, a decoder-only transformer that models the human physiological trajectory generatively, by training on data from the Human Phenotype Project, a multi-visit cohort of over 15,000 deeply phenotyped individuals. We tokenise each participant’s health trajectory across 667 measurements spanning seven domains: blood biomarkers, body composition, sleep physiology, continuous glucose monitoring, gut microbiome, wearable-derived physiology, and behaviour and medication exposure. We train HealthFormer to forecast individual physiological trajectories across these domains, and from this single generative objective a range of clinically relevant tasks can be expressed as queries on the model. We show that, without task-specific training, HealthFormer transfers to four independent cohorts and improves prediction for 27 of 30 incident-disease and mortality endpoints, exceeding established clinical risk scores in every comparison. We further show that the model can simulate interventions in silico: in a held-out personalised-nutrition trial, intervention-conditioned predictions recover individual six-month biomarker changes (e.g., Pearson r = 0.78 for diastolic blood pressure). Across 41 randomised intervention-outcome comparisons drawn from published trials, our results show that the predicted direction of effect agrees in every case, and the predicted mean falls within the reported 95% confidence interval in 30 cases. We position HealthFormer as an initial health world model, from which forecasting, risk stratification, and intervention-conditioned simulation arise as queries, providing a basis for clinical digital twins.
[AI-32] Graph World Models: Concepts Taxonomy and Future Directions
【速读】:该论文旨在解决传统基于扁平张量的世界模型(world models)在噪声敏感性、误差累积以及推理能力薄弱等方面的局限性。其解决方案的关键在于提出并系统化地统一了图世界模型(Graph World Models, GWMs)这一新兴研究范式,通过引入关系归纳偏置(relational inductive biases, RIB)对GWMs进行分类:(1) 空间RIB用于拓扑抽象,(2) 物理RIB用于动态模拟,(3) 逻辑RIB用于因果与语义推理。该方法通过将环境结构化为实体节点与交互边的图结构,在更符合物理和认知规律的空间中建模虚拟环境,从而提升预测准确性与规划效率。
链接: https://arxiv.org/abs/2604.27895
作者: Jiawei Liu,Senqiao Yang,Mingjun Wang,Yu Wang,Bei Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As one of the mainstream models of artificial intelligence, world models allow agents to learn the representation of the environment for efficient prediction and planning. However, classical world models based on flat tensors face several key problems, including noise sensitivity, error accumulation and weak reasoning. To address these limitations, many recent studies use graph structure to decompose the environment into entity nodes and interactive edges, and model virtual environments in a structured space. This paper systematically formalizes and unifies these emerging graph-based works under the concept of graph world models (GWMs). To the best of our knowledge, GWMs have not yet been explicitly defined and surveyed as a unified research paradigm. Furthermore, we propose a taxonomy based on relational inductive biases (RIB), categorizing GWMs by the specific structural priors they inject: (1) spatial RIB for topological abstraction; (2) physical RIB for dynamic simulation; and (3) logical RIB for causal and semantic reasoning. For each model category, we outline the key design principles, summarize representative models, and conduct comparative analyses. We further discuss open challenges and future directions, including dynamic graph adaptation, probabilistic relational dynamics, multi-granularity inductive biases, and the need for dedicated benchmarks and evaluation metrics for GWMs.
[AI-33] In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks
【速读】:该论文旨在解决多轮对话中任务执行效率与准确性的问题,特别是针对需要遵循固定流程的程序性任务(procedural tasks),传统外部调度框架(如LangGraph、CrewAI等)因引入复杂的状态跟踪和路由逻辑而可能降低性能。其解决方案的关键在于采用“上下文内”(in-context)的方法,即将整个任务流程直接嵌入系统提示(system prompt)中,使大语言模型(LLM)能够自主完成任务调度与决策,无需额外的外部协调器。实验证明,在旅行预订、Zoom技术支持和保险理赔处理三个场景下,该方法在五项质量指标上均优于使用LangGraph等外部调度框架的版本,且错误率显著更低,表明当前前沿大模型已具备足够的推理能力来替代传统的分层调度架构。
链接: https://arxiv.org/abs/2604.27891
作者: Simon Dennis,Michael Diamond,Rivaan Patil,Kevin Shabahang,Hao Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages
Abstract:Agent orchestration frameworks – LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, and others – place an external orchestrator above the LLM, tracking state and injecting routing instructions at every turn. We present a controlled comparison showing that for procedural tasks, this architecture is dominated by a simpler alternative: putting the entire procedure in the system prompt and letting the model self-orchestrate. Across three domains – travel booking (14 nodes), Zoom technical support (14 nodes), and insurance claims processing (55 nodes) – we evaluate 200 conversations per condition using LLM-as-judge scoring on five quality criteria. The in-context approach scores 4.53–5.00 on a 5-point scale while a LangGraph orchestrator using the same model scores 4.17–4.84. The orchestrated system fails on 24% of travel, 9% of Zoom, and 17% of insurance conversations, compared to 11.5%, 0.5%, and 5% for the in-context baseline. While external orchestration may have been necessary for earlier models, advances in frontier model capabilities have made it unnecessary for multi-turn conversations following a defined procedure.
[AI-34] Modeling Clinical Concern Trajectories in Language Model Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在临床环境中部署时存在的“突变式”行为问题,即其决策往往依赖于阈值触发,缺乏对风险累积过程的渐进性感知,难以支持临床医生基于持续担忧趋势进行早期干预。解决方案的关键在于引入一种轻量级代理架构,通过将无记忆的临床风险编码器与一阶和二阶状态动力学相结合,生成连续的“升级压力信号”,从而显式建模风险随时间演变的过程。该方法在不转移临床决策权的前提下,使LLM代理能够产生平滑且具有前瞻性的担忧轨迹,提前暴露持续上升的风险信号,提升临床可解释性与人机协同干预的有效性。
链接: https://arxiv.org/abs/2604.27872
作者: Sukesh Subaharan,Venkatesan VS,Murugadasan P,Sivakumar D,Gautham N,Ganeshkumar M
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on gradually rising concern rather than instantaneous triggers. We study whether explicit state dynamics can expose such pre-escalation signals without delegating clinical authority to the agent. We introduce a lightweight agent architecture in which a memoryless clinical risk encoder is integrated over time using first- and second-order dynamics to produce a continuous escalation pressure signal. Across synthetic ward scenarios, stateless agents exhibit sharp escalation cliffs, while second-order dynamics produce smooth, anticipatory concern trajectories despite similar escalation timing. These trajectories surface sustained unease prior to escalation, enabling human-in-the-loop monitoring and more informed intervention. Our results suggest that explicit state dynamics can make LLM agents more clinically legible by revealing how long concern has been rising, not just when thresholds are crossed.
[AI-35] KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
【速读】:该论文旨在解决当前语言模型在长期、非平稳环境中进行序列决策时表现不足的问题,尤其是在开放目标场景下(如体育博彩市场)缺乏有效的评估框架与优化策略。其解决方案的关键在于构建了一个名为KellyBench的仿真环境,模拟2023-24赛季英格兰超级联赛的完整赛季过程,要求智能体基于历史数据(包括高级统计、首发阵容和公众赔率)最大化长期资金增长。该环境迫使模型不仅需建立机器学习模型识别市场偏差,还需动态适应环境变化,从而推动对复杂决策能力的系统性评估。实验表明,现有前沿模型普遍亏损,最优模型平均回报率为-8%,且多数遭遇破产,凸显出当前方法在策略复杂性和鲁棒性上的显著不足。
链接: https://arxiv.org/abs/2604.27865
作者: Thomas Grady,Kip Parker,Iliyan Zarov,Henry Course,Chengxi Taylor,Ross Taylor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at this https URL.
[AI-36] Rethinking Agent ic Reinforcement Learning In Large Language Models
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)在面对复杂、开放性任务时的局限性,即其通常依赖预定义奖励函数和狭隘环境,难以实现自主目标设定、长期规划与动态策略调整。解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的代理式强化学习(Agentic RL)框架,该框架将元推理(meta-reasoning)、自我反思(self-reflection)和多步决策等类认知能力嵌入学习循环中,从而赋予智能体在不确定真实环境中进行交互式推理与自适应行为的能力。
链接: https://arxiv.org/abs/2604.27859
作者: Fangming Cui,Ruixiao Zhu,Cheng Fang,Sunan Li,Jiahong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.
[AI-37] AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework
【速读】:该论文旨在解决生成式 AI (Generative AI) 推理任务在地理分布场景下如何有效迁移计算以优化电力成本与碳排放的问题。其核心挑战在于:AI 推理负载虽可脱离用户服务位置执行,但受限于延迟、状态本地性、算力容量及法规等约束条件,需明确何时数字计算迁移可转化为电力需求的地理重构。解决方案的关键在于构建一个三层次能量-地理框架(energy-geography framework),将推理放置建模为多约束优化问题,并提出“能源-延迟前沿”(energy-latency frontier)作为核心指标——量化放宽延迟预算所能释放的边际成本节约与碳减排收益;同时引入可迁移推理需求、延迟回报率等操作指标和迁移阈值条件,揭示不同延迟容忍度如何划分本地、区域与能源导向的执行层级,从而指导在迁移摩擦、出站成本、法律限制等因素影响下的最优部署策略。
链接: https://arxiv.org/abs/2604.27855
作者: Xubin Luo,Yang Cheng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures, 8 tables; preprint
Abstract:AI inference is becoming a persistent and geographically distributed source of electricity demand. Unlike many traditional electrical loads, inference workloads can sometimes be executed away from the user-facing service location, provided that latency, state locality, capacity, and regulatory constraints remain acceptable. This paper studies when such digital relocation of computation can be interpreted as latency-constrained relocation of electricity demand. We develop an energy-geography framework for geo-distributed AI inference. The framework models a three-layer architecture of clients, service nodes, and compute nodes, and formulates inference placement as a constrained optimization problem over electricity prices, marginal carbon intensity, power usage effectiveness, compute capacity, network latency, and migration frictions. The key object is the energy-latency frontier: the marginal cost and carbon benefit unlocked by relaxing inference latency budgets. The paper makes four contributions. First, it distinguishes physical electricity transmission from digital relocation of electricity-consuming computation. Second, it formulates a geo-distributed inference placement model with feasibility masks and migration frictions. Third, it introduces operational metrics, including relocatable inference demand, energy return on latency, carbon return on latency, and a relocation break-even condition. Fourth, it provides a transparent stylized simulation over representative global compute regions to show how heterogeneous latency tolerance separates workloads into local, regional, and energy-oriented execution layers. The results show that latency relaxation expands feasible geography, while migration frictions, egress costs, state locality, legal constraints, and capacity limits can sharply reduce realized benefits. Comments: 29 pages, 3 figures, 8 tables; preprint Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27855 [cs.DC] (or arXiv:2604.27855v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.27855 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xubin Luo [view email] [v1] Thu, 30 Apr 2026 13:40:26 UTC (32 KB)
[AI-38] A Grid-Aware Agent -Based Model for Analyzing Electric Vehicle Charging Systems
【速读】:该论文旨在解决电动汽车(Electric Vehicle, EV)充电系统在不同基础设施配置和运行条件下,如何协同分析用户级充电行为与设施层面功率特性的问题。其核心挑战在于现有模型难以同时捕捉EV用户的异质性行为、充电设备约束以及电网侧的功率分配机制。解决方案的关键在于构建一个可配置、网格感知的基于智能体的模型(Agent-Based Model, ABM),通过集成多类型EV行为、充电柱约束及共享能源沙盒(Energy Sandbox)实现聚合功率动态调控,从而在事件驱动的仿真框架下,系统评估充电策略对服务性能、设施利用率和负荷特征的影响,为后续高级协调策略研究提供方法论基础。
链接: https://arxiv.org/abs/2604.27849
作者: Khalil Al-Rahman Youssefi,Marija Gojkovic,Walter Stefanutti,Mika Auer,Melanie Schranz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the author’s version of a paper submitted to SIMULTECH. 12 Pages, 1 Table, 10 Figures
Abstract:This paper presents a configurable, grid-aware Agent-Based Model (ABM) for the systematic analysis of electric vehicle (EV) charging systems under configurable infrastructure and operational conditions. The model integrates heterogeneous EV behavior, charging column constraints, and a shared Energy Sandbox that regulates aggregate power allocation, enabling the joint study of user-centric charging dynamics and facility-level power behavior. Implemented in Python using the SimPy discrete-event framework, the approach supports scalable, event-driven simulations across varying system sizes, charger compositions, and scheduling strategies. A representative workplace charging scenario is investigated to illustrate how infrastructure configuration and coordination mechanisms influence energy delivery performance, infrastructure utilization, and aggregate load characteristics. The results highlight the context-dependence of infrastructure suitability and demonstrate how charging strategies and charger types reshape both service-level outcomes and grid-facing behavior. The proposed ABM provides a flexible and extensible simulation environment for exploring technical, operational, and grid-aware aspects of EV charging ecosystems, and for serving as a methodological basis for subsequent studies on advanced coordination strategies beyond the specific scenario analyzed in this study.
[AI-39] CastFlow: Learning Role-Specialized Agent ic Workflows for Time Series Forecasting
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的时间序列预测方法普遍采用静态生成范式所带来的局限性,包括有限的时序模式提取能力、单轮上下文特征获取、一次性预测生成以及缺乏集成预测支持等问题。其解决方案的关键在于提出一种动态代理式预测框架 CastFlow,该框架通过规划-行动-预测-反思的代理工作流实现多视角时序模式提取与多轮上下文特征获取,并引入记忆模块和多视角工具包以提供可靠的集成预测基准;同时采用角色专业化设计,将冻结的通用LLM与微调后的领域特定LLM相结合,后者基于集成预测基线进行证据引导的数值预测,而非从零开始建模。此外,为优化领域特定LLM,进一步设计了结合监督微调(Supervised Fine-Tuning, SFT)与可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)的两阶段任务导向训练流程,从而显著提升预测准确性与适应性。
链接: https://arxiv.org/abs/2604.27840
作者: Bokai Pan,Mingyue Cheng,Zhiding Liu,Shuo Yu,Xiaoyu Tao,Yuchong Wu,Qi Liu,Defu Lian,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, large language models (LLMs) have shown great promise in time series forecasting. However, most existing LLM-based forecasting methods still follow a static generative paradigm that directly maps historical observations to future values in a single pass. Under this paradigm, forecasting is constrained by limited temporal pattern extraction, single-round acquisition of contextual features, one-shot forecast generation, and lack of support from ensemble forecasts. To address these limitations, in this work, we propose CastFlow, a dynamic agentic forecasting framework that enables multi-view temporal pattern extraction, multi-round contextual features acquisition, iterative forecast refinement, and forecasting with ensemble forecasts. First, CastFlow organizes the forecasting process into planning, action, forecasting, and reflection, establishing an agentic workflow. Second, this workflow is supported by a memory module that retrieves prior experience and a multi-view toolkit that constructs diagnostic evidence and provides a reliable ensemble forecast baseline. Third, CastFlow adopts a role-specialized design that combines general-purpose reasoning with specialized numerical forecasting. Under this design, a frozen LLM preserves general-purpose reasoning, while a fine-tuned domain-specific LLM performs evidence-guided numerical forecasting based on the ensemble forecast baseline, rather than from scratch. To optimize a fine-tuned domain-specific LLM, we further develop a two-stage workflow-oriented training that combines supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). To evaluate the effectiveness of CastFlow, we conduct extensive experiments on diverse datasets and show that it achieves superior overall results against strong baselines. We hope that this work can serve as a step toward more adaptive and accurate time series forecasting.
[AI-40] MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents
【速读】:该论文旨在解决多服务器模型协作代理(Multi-server MCP agents)中因信息流控制不当而导致的非恶意凭证传播问题,即个体无害的读写权限在跨边界工具组合时可能引发结构化侧信道泄露,而这种泄露并非由恶意模型行为引起,而是由工作流拓扑结构导致的副作用。解决方案的关键在于提出MCPHunt基准测试框架,其核心创新包括:基于“canary”标记的污点追踪机制,将凭证传播检测转化为客观字符串匹配;环境可控的覆盖率设计,通过风险性、良性及硬负样本条件验证流水线正确性并排除凭证格式混淆因素;以及CRS分层策略,区分任务必需的凭证传递(忠实执行原样传输指令)与违反策略的传播(本可删除却未删除凭证)。实证结果显示,政策违规传播率在11.5%至41.3%之间,且高度依赖路径特征和浏览器中介数据流,提示仅靠提示层防御难以完全抑制传播,需结合任务理解能力优化。
链接: https://arxiv.org/abs/2604.27819
作者: Haonan Li,Tianjun Sun,Yongqing Wang,Qisheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure, 16 tables. Code: this https URL Data: this https URL
Abstract:Multi-server MCP agents create an information-flow control problem: faithful tool composition can turn individually benign read/write permissions into cross-boundary credential propagation – a structural side effect of workflow topology, not necessarily malicious model behavior. We present MCPHunt, to our knowledge the first controlled benchmark that isolates non-adversarial, verbatim credential propagation across multi-server MCP trust boundaries, with three methodological contributions: (1) canary-based taint tracking that reduces propagation detection to objective string matching; (2) an environment-controlled coverage design with risky, benign, and hard-negative conditions that validates pipeline soundness and controls for credential-format confounds; (3) CRS stratification that disentangles task-mandated propagation (faithful execution of verbatim-transfer instructions) from policy-violating propagation (credentials included despite the option to redact). Across 3,615 main-benchmark traces from 5 models spanning 147 tasks and 9 mechanism families, policy-violating propagation rates reach 11.5–41.3% across all models. This propagation is pathway-specific (25x cross-mechanism range) and concentrated in browser-mediated data flows; hard-negative controls provide evidence that production-format credentials are not necessary – prompt-directed cross-boundary data flow is sufficient. A prompt-mitigation study across 3 models reduces policy-violating propagation by up to 97% while preserving 80.5% utility, but effectiveness varies with instruction-following capability – suggesting that prompt-level defenses alone may not suffice. Code, traces, and labeling pipeline are released under MIT and CC BY 4.0.
[AI-41] Focus Session: Autonomous Systems Dependability in the era of AI: Design Challenges in Safety Security Reliability and Certification
【速读】:该论文旨在解决嵌入式安全关键系统(如下一代汽车和自动驾驶平台)在面对日益增长的系统复杂性、软硬件异构性以及智能数据驱动组件集成时,如何保障其可靠性和可认证性的难题。传统方法难以应对由人工智能(Artificial Intelligence, AI)与机器学习(Machine Learning, ML)组件引入的动态不确定行为,尤其是在严格的实时性、功耗和安全性约束下。论文提出的关键解决方案在于构建一种跨抽象层级的全生命周期方法论,涵盖设计阶段与运行阶段的保障机制,重点推进可靠性建模、安全架构设计及适配学习型组件的认证方法,从而弥合AI技术创新与系统级可认证可靠性之间的鸿沟。
链接: https://arxiv.org/abs/2604.27807
作者: Behnaz Ranjbar,Kirankumar Raveendiran,Sudeep Pasricha,Samarjit Chakraborty,Cecilia Carbonelli,Akash Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The design of embedded safety-critical systems such as those used in next-generation automotive and autonomous platforms, is increasingly challenged by escalating system complexity, hardware-software heterogeneity, and the integration of intelligent, data-driven components. Ensuring dependability in such systems requires a holistic approach that spans multiple abstraction layers and encompasses both design- and run-time assurance. Traditional methods for reliability, safety, and security management often fall short in addressing the dynamic and uncertain behaviors introduced by Artificial Intelligence (AI) and Machine Learning (ML) components, especially under stringent real-time, power, and safety constraints. While AI and ML offer powerful predictive, adaptive, and self-optimizing capabilities that can enhance system dependability, their inherent non-determinism, data-dependence, and lack of formal guarantees introduce new challenges for verification, validation, and certification. This paper explores emerging methodologies, architectures, and frameworks for designing dependable autonomous and embedded systems in the era of AI. It highlight advances in reliability modeling, secure system design, and certification approaches that account for imperfect, learning-enabled components, aiming to bridge the gap between AI innovation and certifiable system-level dependability.
[AI-42] Post-Optimization Adaptive Rank Allocation for LoRA
【速读】:该论文旨在解决标准低秩适应(Low-Rank Adaptation, LoRA)方法在参数高效微调中因对所有模型层强制使用统一秩而导致的参数冗余问题,该问题源于未考虑不同层内在维度(intrinsic dimensionality)的差异。解决方案的关键在于提出一种无需数据的后处理压缩方法——后优化自适应秩分配(Post-Optimization Adaptive Rank Allocation, PARA),其核心机制是利用奇异值分解(Singular Value Decomposition, SVD)对所有层的奇异值施加全局阈值,从而基于层级谱重要性实现非均匀秩分配,有效降低冗余参数量并保持原始LoRA的预测性能。
链接: https://arxiv.org/abs/2604.27796
作者: Vishnuprasadh Kumaravelu,Sunil Gupta,P. K. Srijith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Exponential growth in the scale of modern foundation models has led to the widespread adoption of Low-Rank Adaptation (LoRA) as a parameter-efficient fine-tuning technique. However, standard LoRA implementations disregard the varying intrinsic dimensionality of model layers and enforce a uniform rank, leading to parameter redundancy. We propose Post-Optimization Adaptive Rank Allocation (PARA), a data-free compression method for LoRA that integrates seamlessly into existing fine-tuning pipelines. PARA leverages Singular Value Decomposition to prune LoRA ranks using a global threshold over singular values across all layers. This results in non-uniform rank allocation based on layer-wise spectral importance. As a post-hoc method, PARA circumvents the training modifications and resulting instabilities that dynamic architectures typically incur. We empirically demonstrate that PARA reduces parameter count by 75-90% while preserving the predictive performance of the original, uncompressed LoRA across multiple vision and language benchmarks. Code will be published upon acceptance.
[AI-43] st Before You Deploy: Governing Updates in the LLM Supply Chain
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中因提供商侧无声更新(silent updates)导致的行为漂移(behavioral drift)问题,这种漂移可能引发功能退化、格式错误、安全约束失效等兼容性风险。解决方案的关键在于提出一种部署端治理框架,包含三个核心组件:明确定义模型行为规则的“生产契约”(production contracts)、按部署风险类别组织的针对性测试套件(risk-category-based testing suite),以及基于安全与性能标准的发布检查点(compatibility gates),从而实现对非透明模型演进过程中的兼容性控制。
链接: https://arxiv.org/abs/2604.27789
作者: Mohd Sameen Chishti,Damilare Peter Oyinloye,Jingyue Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, accepted to The 2nd International Workshop on Large Language Model Supply Chain Analysis (LLMSC2026) co-located with FSE 2026
Abstract:Large Language Models (LLMs) are increasingly used as core dependencies in software systems. However, the hosted LLM services evolve continuously through provider-side updates without explicit version changes. These silent updates can introduce behavioral drift, causing regressions in functionality, formatting, safety constraints, or other application-specific requirements. Existing approaches focus primarily on regression testing or versioning but do not provide deployer-side mechanisms for governing compatibility during opaque model evolution. This paper proposes a deployment-side governance framework based on three components: clearly defined rules for how the model is allowed to behave (production contracts), focused testing organized by deployment risk categories (risk-category-based testing suite), and release checkpoints that block updates unless they meet defined safety and performance standards (compatibility gates). Through exploratory validation across multiple LLM versions, we provide evidence that targeted testing in specific risk areas can uncover performance regressions that overall metrics miss. We also identify several open research challenges, including how to systematically build effective test suites, how to set reliable performance thresholds in non-deterministic systems, and how to detect and explain model drift when providers offer limited transparency. Overall, we frame LLM update management as a software supply chain governance problem and outline a research agenda for putting deployer-side compatibility controls into practice.
[AI-44] RuC: HDL-Agnostic Rule Completion Benchmark Generation
【速读】:该论文旨在解决现有基准测试在评估大语言模型(Large Language Models, LLMs)于寄存器传输级(Register Transfer Level, RTL)开发中的代码补全能力时,缺乏对补全粒度和语法范围的控制问题。传统方法要么评估整个硬件模块的生成,要么仅完成单行代码,无法灵活调整任务复杂度与语法层级。解决方案的关键在于提出一种语言无关的规则补全框架(Rule-completion, RuC),该框架基于目标硬件描述语言(Hardware Description Language, HDL)的语法规则自动掩蔽并重构代码片段,从而实现从赋值语句到完整逻辑模块等任意粒度的可控补全任务生成。RuC通过语法驱动的方式,使模型在上下文约束下重建被掩蔽区域,显著提升了对领域特定代码理解能力的可扩展、可比较评估能力。
链接: https://arxiv.org/abs/2604.27780
作者: Arnau Ayguadé Domingo,Miquel Alberti-Binimelis,Cristian Gutierrez-Gomez,Emanuele Parisi,Razine Moundir Ghorab,Miquel Moreto,Gokcen Kestor,Dario Garcia-Gasulla
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures
Abstract:Large Language Models (LLMs) have rapidly improved in performance across code-related tasks, making their integration into Register Transfer Level (RTL) development increasingly attractive. Mimicking the behavior of inline code assistants, many benchmarks evaluate LLMs’ capabilities in code completion, either assessing the generation of entire hardware modules or the completion of a single line within a module. However both of these approaches lack the ability to control the granularity of the code-completion sample size and the syntactic range of completions. To overcome these limitations, we present a framework for language-agnostic rule completion (RuC), a grammar-driven, rule-selectable benchmark generator that automatically produces RTL code-completion tasks from a set of input hardware description sources. RuC uses the target Hardware Description Language (HDL) grammar to mask syntactically defined code regions and prompts a model to regenerate them using the surrounding unmasked code as context, enabling a controlled and scalable evaluation of the domain-specific model’s code-understanding capabilities, ranging from assignments to the reconstruction of entire logic blocks. We use RuC to generate two SystemVerilog rule-completion benchmarks from the Tiny Tapeout shuttle TT07 and the CVE2 RISC-V core to demonstrate RuC’s applicability to a broad range of designs, and conduct a comparative study of the code completion capabilities of modern open-source LLMs across diverse settings. Results indicate that completion performance strongly depends on the model type, the grammatical structure of the masked region, and the prompting strategy. Specifically, the highest scores are obtained with Fill-in-the-Middle (FIM) prompting. These findings highlight the value of grammar-driven, arbitrarily granular benchmarks for meaningful evaluation of LLM capabilities in RTL development workflows.
[AI-45] Intent2Tx: Benchmarking LLM s for Translating Natural Language Intents into Ethereum Transactions
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在Web3场景下将用户自然语言意图转化为功能正确、状态依赖的链上交易时存在的能力缺口问题。现有基准测试无法准确刻画真实世界中复杂协议交互的多样性与动态性,尤其在多步骤操作和跨状态推理方面表现不足。解决方案的关键在于提出一个高保真度的基准数据集——\textscIntent2Tx,其包含来自以太坊主网300天真实交易痕迹构建的29,921个单步与1,575个多步实例,覆盖11类协议交互,涵盖多样化的长尾去中心化金融(Decentralized Finance, DeFi)原语;同时设计了一个执行感知的评估框架,通过在分叉的主网环境中进行差分状态分析,超越表面文本匹配,从而严格验证LLM输出是否达成预期的状态转移。这一方法揭示了当前模型在分布外泛化能力和多步规划上的局限,并凸显出“推理到执行”能力的重大差距。
链接: https://arxiv.org/abs/2604.27763
作者: Zhuoran Pan,Yue Li,Zhi Guan,Jianbin Hu,Zhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of Large Language Models (LLMs) offers a transformative interface for Web3, yet existing benchmarks fail to capture the complexity of translating high-level user intents into functionally correct, state-dependent on-chain transactions. We present \textscIntent2Tx, a high-fidelity benchmark featuring 29,921 single-step and 1,575 multi-step instances meticulously derived from 300 days of real-world Ethereum mainnet traces. Unlike prior works that rely on synthetic instructions, \textscIntent2Tx grounds natural language intents in real-world protocol interactions across 11 categories, including diverse long-tail Decentralized Finance (DeFi) primitives. To enable rigorous evaluation, we propose an execution-aware framework that transcends surface-level text matching by employing differential state analysis on forked mainnet environments. Our extensive evaluation of 16 state-of-the-art LLMs reveals that while scaling and retrieval-augmentation enhance logical consistency and parameter precision, current models struggle with out-of-distribution generalization and multi-step planning. Crucially, our execution-based analysis demonstrates that syntactically valid outputs often fail to achieve intended state transitions, highlighting a significant gap in current “reasoning-to-execution” capabilities. \textscIntent2Tx serves as a critical foundation for developing autonomous, reliable agents in intent-centric Web3 ecosystems. Code and data: this https URL .
[AI-46] Consumer Attitudes Towards AI in Digital Health: A Mixed-Methods Survey in Australia
【速读】:该论文旨在解决当前医疗人工智能(Healthcare AI)部署中消费者态度与实际应用之间存在的鸿沟问题,即尽管技术性能不断提升,但患者对AI应用的接受度仍受其感知的准确性、安全性及数据使用风险等因素制约。解决方案的关键在于通过具体场景评估(如AI生成与医生撰写的诊疗摘要对比),揭示消费者更关注AI输出的沟通质量与可见的人类治理机制,而非单纯的技术指标,从而强调临床监督框架在提升信任和采纳中的核心作用。
链接: https://arxiv.org/abs/2604.27744
作者: Wei Zhou,Rashina Hoda,Joycelyn Ling
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI applications are increasingly being introduced into digital health. While technical performance has advanced rapidly, successful deployment mainly depends on consumer attitudes, especially to patient-facing applications. However, most existing research examines consumer attitudes towards healthcare AI at an abstract level rather than in response to concrete artefacts. We report a mixed-methods survey study in Australia (N=275) examining consumer readiness, acceptance, trust, and risk perceptions of healthcare AI, combined with a scenario-based evaluation of an AI-generated versus clinician-written consultation summary. Participants expressed moderate optimism and strong perceived usefulness and ease of use, but also substantial concerns about accuracy, safety, and data use. In the scenario task, the AI-generated summary was strongly preferred for quality, empathy, and overall usefulness, yet identification of the AI summary was near chance. Findings show that consumers judge AI through concrete communication quality and visible human governance, underscoring the need for clinically supervised deployment frameworks beyond technical performance alone.
[AI-47] Why Self-Supervised Encoders Want to Be Normal
【速读】:该论文旨在解决编码器-解码器学习中如何在有限监督或无监督场景下构建具有理论保障的表征学习框架问题。其核心挑战在于平衡压缩信息(率)与保持预测性能(失真)之间的权衡,同时确保模型在低数据量下的泛化能力。解决方案的关键在于基于信息瓶颈(Information Bottleneck, IB)原理,将IB重构为以KL散度为失真的率失真问题,并证明最优表示是在概率单纯形内对预测流形(predictive manifold)进行软聚类,且可线性解码;进一步提出Sketch Isotropic Gaussian Regularization(SIGReg),通过一系列精确变换(从平坦Dirichlet分布到指数族再到各向同性高斯分布)实现该原则的高斯松弛,量化熵开销并作为分布正则项,从而在半监督和自监督设置中提供一致的率-失真控制。
链接: https://arxiv.org/abs/2604.27743
作者: Yuval Domb
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We develop a geometric and information-theoretic framework for encoder-decoder learning built on the Information Bottleneck (IB) principle. Recasting IB as a rate-distortion problem with Kullback-Leibler (KL) divergence as distortion, we show that the optimal representation at any distortion level is a soft clustering of the \emphpredictive manifold \mathcalM=\p(Y|x):x\in\mathcalX\ inside the probability simplex, admitting a linear decoder in the canonical parameterization. We derive a chain of exact transformations, from flat Dirichlet to exponential to isotropic Gaussian, connecting the maximum entropy prior on the simplex to Euclidean space, with quantified entropy overhead at each step, and show that Sketched Isotropic Gaussian Regularization (SIGReg) implements a Gaussian relaxation of this principle whose overhead affects rate accounting but not achievable prediction. This relaxation provides a principled distributional regularizer for learning with limited or no supervision. Using the Conditional Entropy Bottleneck (CEB) decomposition, we derive concrete encoder losses for supervised and semi-supervised settings, estimated via minibatch marginals without variational bounds. In the self-supervised setting, the CEB conditional rate is replaced by a view-prediction proxy. SIGReg serves as the distributional regularizer for both the semi-supervised and self-supervised settings. Experiments on toy problems and FashionMNIST confirm the predicted rate-distortion trade-offs and show that the non-parametric estimator is competitive with the standard variational approach. Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.27743 [cs.IT] (or arXiv:2604.27743v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2604.27743 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuval Domb PhD [view email] [v1] Thu, 30 Apr 2026 11:33:05 UTC (990 KB)
[AI-48] Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering
【速读】:该论文旨在解决医学检索增强生成(Medical Retrieval-Augmented Generation, RAG)系统在处理生物医学文献时忽略丰富视觉内容(如表格、图表和结构化布局)的问题,从而限制了其对复杂医学知识的准确理解和生成能力。解决方案的关键在于提出MED-VRAG,一个迭代式多模态RAG框架,它直接从PMC文档页面图像中检索并推理信息,而非依赖OCR提取的文本片段。该框架通过将ColQwen2.5的patch级页面嵌入与分片MapReduce LLM过滤器结合,在约35万页规模下实现高效检索(第一阶段<30ms),并利用视觉语言模型(VLM)在最多三轮推理中迭代优化查询、积累证据至记忆库,显著提升医疗问答任务性能(平均准确率达78.6%)。
链接: https://arxiv.org/abs/2604.27724
作者: Xupeng Chen,Binbin Shi,Chenqian Le,Jiaqi Zhang,Kewen Wang,Ran Gong,Jinhan Zhang,Chihang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR’d text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.
[AI-49] Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures Format Collapse and Domain Adaptation
【速读】:该论文旨在解决前沿视觉语言模型(Vision-Language Models, VLMs)在临床医学场景中部署时的可审计性与可信性问题,特别是针对其在医学视觉问答(Medical VQA)任务中的感知能力不足和模块化流水线集成缺陷。研究发现,当前主流VLMs在解剖结构与病灶定位上表现不佳(最佳模型平均IoU仅为0.23,Acc@0.5仅19.1%),并存在危险的左右侧混淆问题;同时,采用自接地(self-grounding)两阶段流水线(先定位后回答)显著降低VQA准确率,主要源于定位错误及提示格式不兼容导致的解析失败(如Gemini和GPT-5在VQA-RAD数据集上的解析失败率达70%–99%)。关键解决方案在于识别出感知模块(perception module)是信任瓶颈的核心,并通过监督微调(supervised fine-tuning)对Qwen-2.5-VL进行领域适配,在SLAKE开放问答召回率上达到85.5%,表明VQA性能差距可通过领域数据增强缓解,但感知层面的信任问题仍需进一步探索。
链接: https://arxiv.org/abs/2604.27720
作者: Xupeng Chen,Binbin Shi,Chenqian Le,Qifu Yin,Lang Lin,Haowei Ni,Ran Gong,Panfeng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying vision-language models (VLMs) in clinical settings demands auditable behavior under realistic failure conditions, yet the failure landscape of frontier VLMs on specialized medical inputs is poorly characterized. We audit five recent frontier and grounding-aware VLMs (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on Medical VQA along two trust-relevant axes. Perception: all models localize anatomical and pathological targets poorly – the best model reaches only 0.23 mean IoU and 19.1% Acc@0.5 – and exhibit clinically dangerous laterality confusion. Pipeline integration: a self-grounding pipeline, where the same model localizes then answers, degrades VQA accuracy for every model – driven by both inaccurate localization and format-compliance failures under the two-step prompt (parse failure rises to 70%–99% for Gemini and GPT-5 on VQA-RAD). Replacing predicted boxes with ground-truth annotations recovers and improves VQA accuracy, consistent with the failure residing in the perception module rather than in the decomposition itself. These observational findings identify grounding quality as a primary trustworthiness bottleneck in our SLAKE bounding-box setting. As a complementary fine-tuning follow-up, supervised fine-tuning of Qwen~2.5~VL on combined Med-VQA training data attains the highest reported SLAKE open-ended recall (85.5%) among comparable methods, suggesting that the VQA-level gap is tractable with domain adaptation; whether this also closes the perception/trustworthiness bottleneck is left to future work.
[AI-50] Knowledge Graph Representations for LLM -Based Policy Compliance Reasoning
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 风险日益增长背景下,如何高效、准确地从复杂的 AI 政策文档中提取和检索相关信息以支持合规性问答的问题。其解决方案的关键在于构建一个基于代理(agentic)框架的知识图谱(Knowledge Graph, KG)系统:首先从三份与 AI 风险相关的政策文件中构建 KG,采用两种本体(ontology)模式进行结构化建模;随后利用五种大语言模型(Large Language Models, LLMs)在 42 个涵盖六类推理类型(从实体查找至跨政策推理)的政策问答任务上进行评估,结果表明,知识图谱增强显著提升了所有模型的表现,且由 LLM 自主发现的开放本体模式在性能上可媲美甚至超越正式定义的本体结构。
链接: https://arxiv.org/abs/2604.27713
作者: Wilder Baldwin,Sepideh Ghanavati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The risks posed by AI features are increasing as they are rapidly integrated into software applications. In response, regulations and standards for safe and secure AI have been proposed. In this paper, we present an agentic framework that constructs knowledge graphs (KGs) from AI policy documents and retrieves policy-relevant information to answer questions. We build KGs from three AI risk-related polices under two ontology schemas, and then evaluate five LLMs on 42 policy QA tasks spanning six reasoning types, from entity lookup to cross-policy inference, using both heuristic scoring and an LLM-as-judge. KG augmentation improves scores for all five models, and an open, LLM-discovered schema matches or exceeds the formal ontology.
[AI-51] Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
【速读】:该论文旨在解决当前具身智能体(embodied agents)普遍存在的局限性问题,即它们通常仅能被动执行指令或响应即时需求,缺乏稳定且高阶的价值框架,难以实现长期自主行为和有效处理动机冲突。解决方案的关键在于提出一种分层认知架构——ValuePlanner,其核心创新是将高层价值调度(high-level value scheduling)与底层动作执行(low-level action execution)解耦,并通过基于大语言模型(LLM)的认知模块生成符号化子目标(symbolic subgoals),再由经典PDDL规划器将其转化为可执行的动作计划,整个过程通过闭环反馈机制持续优化。这一设计使智能体能够基于抽象价值权衡自主规划长时程行为,从而实现更符合人类价值观的自驱动决策。
链接: https://arxiv.org/abs/2604.27699
作者: Chunhui Zhang,Yuxuan Wang,Aoyang Qin,Yi-Long Lu,Kunlun Wu,Yizhou Wang,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current embodied agents are often limited to passive instruction-following or reactive need-satisfaction, lacking a stable, high-order value framework essential for long-term, self-directed behavior and resolving motivational conflicts. We introduce \textitValuePlanner, a hierarchical cognitive architecture that decouples high-level value scheduling from low-level action execution. \textitValuePlanner employs an LLM-based cognitive module to generate symbolic subgoals by reasoning through abstract value trade-offs, which are then translated into executable action plans by a classical PDDL planner. This process is refined via a closed-loop feedback mechanism. Evaluating such autonomy requires methods beyond task-success rates, and we therefore propose a value-centric evaluation suite measuring cumulative value gain, preference alignment, and behavioral diversity. Experiments in the TongSim household environment demonstrate that \textitValuePlanner arbitrates competing values to generate coherent, long-horizon, self-directed behavior absent from instruction-following and needs-driven baselines. Our work offers a structured approach to bridging intrinsic values and grounded behavior for autonomous agents.
[AI-52] When Agents Evolve Institutions Follow
【速读】:该论文试图解决多智能体系统(Multi-Agent Systems)在面对复杂任务时的集体组织问题,即如何在认知有限且信息不完整的个体之间有效协调集体行动。其核心挑战并非单个智能体的个体智能,而是群体层面的治理结构设计。解决方案的关键在于借鉴历史政治制度的经验,将七种经典治理模式转化为可执行的多智能体架构,并通过在三个大语言模型和两个基准测试上的对比实验,验证了治理拓扑结构对集体性能的显著影响——最优架构随模型能力与任务特性动态变化,表明未来集体智能的发展路径应从单一固定组织形式转向可动态重构的自适应治理机制,从而实现从“自进化智能体”到“自进化多智能体系统”的范式跃迁。
链接: https://arxiv.org/abs/2604.27691
作者: Chao Fei,Hongcheng Guo,Yanghua Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Across millennia, complex societies have faced the same coordination problem of how to organize collective action among cognitively bounded and informationally incomplete individuals. Different civilizations developed different political institutions to answer the same basic questions of who proposes, who reviews, who executes, and how errors are corrected. We argue that multi-agent systems built on large language models face the same challenge. Their central problem is not only individual intelligence, but collective organization. Historical institutions therefore provide a structured design space for multi-agent architectures, making key trade-offs between efficiency and error correction, centralization and distribution, and specialization and redundancy empirically testable. We translate seven historical political institutions, spanning four canonical governance patterns, into executable multi-agent architectures and evaluate them under identical conditions across three large language models and two benchmarks. We find that governance topology strongly shapes collective performance. Within a single model, the gap between the best and worst institution exceeds 57 percentage points, while the optimal architecture shifts systematically with model capability and task characteristics. These results suggest that collective intelligence will not advance through a single optimal organizational form, but through governance mechanisms that can be reselected and reconfigured as tasks and capabilities evolve. More broadly, this points to a transition from \textbfself-evolving agents to the \textbfself-evolving multi-agent system. The code is available on \hrefthis https URLGitHub.
[AI-53] Fairness for distribution network operations and planning
【速读】:该论文旨在解决配电网络(Distribution Network, DN)规划与运行中公平性(fairness)融入的挑战,特别是如何量化和实现不同公平性理念对资源分配效率的影响。其核心问题是:在追求社会公平(如减少区域间电价差异或服务不均)时,所付出的效率代价(即“公平价格”Price of Fairness, PoF)如何被合理评估,并在多维公平指标下进行优化决策。解决方案的关键在于系统梳理从平等主义到功绩导向等多样化的公平度量标准(metrics),并分析这些指标在数学建模上的复杂性(线性至非线性规划),从而为DN运营中的利益相关方提供一致且透明的决策依据,推动公平与效率之间的平衡优化。
链接: https://arxiv.org/abs/2604.27669
作者: Pedro F. C. de Carvalho,Zijie Liu,Md Umar Hashmi,Dirk Van Hertem
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 16 pages, 0 figures, 2 tables, CIRED Conference Workshop Brussels 2026
Abstract:The incorporation of fairness into the distribution network (DN) planning and operation has become a key goal of recent studies. The cost of implementing fairness, denominated the price of fairness (PoF), covers the efficiency that is renounced for attaining social cohesion through fair outcomes. Locational disparity makes fairness schemes emerge to level the consumers playing field. However, fairness encompasses a range of notions. From egalitarian to merit-based criteria, various metrics are implemented as a tool for measuring equitable utility distribution. These have different mathematical complexities, from linear to non-linear programming cases, which affect their overall applicability. Hence, this study compiles the overarching fairness notions and metrics, reviewing how these affect stakeholders and the inherent mathematical optimisation in resource allocation problems. The aim is to support consistent and transparent planning and decision-making within DN operations.
[AI-54] From Context to Skills: Can Language Models Learn from Context Skillfully?
【速读】:该论文旨在解决语言模型(Language Models, LMs)在面对超出其参数化知识范围的复杂上下文时,如何实现高效上下文学习(Context Learning)的问题。核心挑战在于:一方面,手动标注长且技术密集的上下文技能成本高昂;另一方面,缺乏外部反馈信号以判断所构建的技能是否有效。解决方案的关键在于提出一种自演化框架 Ctx2Skill,其核心是一个多智能体自对弈循环(multi-agent self-play loop),包含 Challenger(生成探测任务和评分标准)、Reasoner(基于演进的技能集尝试解决问题)和 Judge(提供二元反馈)。其中,Challenger 和 Reasoner 通过专用的 Proposer 和 Generator 代理分析失败案例并合成针对性技能更新,从而实现无监督下的技能自动发现与优化。此外,引入 Cross-time Replay 机制防止因极端任务生成或技能过度专业化导致的对抗性崩溃,确保技能演化的鲁棒性和泛化能力。最终生成的技能可无缝集成至任意语言模型以提升其上下文学习性能。
链接: https://arxiv.org/abs/2604.27660
作者: Shuzheng Si,Haozhe Zhao,Yu Lei,Qingyi Wang,Dingwei Chen,Zhitong Wang,Zhenhailong Wang,Kangyang Luo,Zheng Wang,Gang Chen,Fanchao Qi,Minjia Zhang,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills. However, constructing such skills for context learning scenarios faces two challenges: the prohibitive cost of manual skill annotation for long, technically dense contexts, and the lack of external feedback for automated skill construction, since there is no automatic signal to tell whether a proposed skill is helpful. In this paper, we propose Ctx2Skill, a self-evolving framework that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. At its core, a multi-agent self-play loop has a Challenger that generates probing tasks and rubrics, a Reasoner that attempts to solve them guided by an evolving skill set, and a neutral Judge that provides binary feedback. Crucially, both the Challenger and the Reasoner evolve through accumulated skills: dedicated Proposer and Generator agents analyze failure cases and synthesize them into targeted skill updates for both sides, enabling automated skill discovery and refinement. To prevent adversarial collapse caused by increasingly extreme task generation and over-specialized skill accumulation, we further introduce a Cross-time Replay mechanism that identifies the skill set achieving the best balance across representative cases for the Reasoner side, ensuring robust and generalizable skill evolution. The resulting skills can be plugged into any language model to obtain better context learning capability. Evaluated on four context learning tasks from CL-bench, Ctx2Skill consistently improves solving rates across backbone models.
[AI-55] When Does Structure Matter in Continual Learning? Dimensionality Controls When Modularity Shapes Representational Geometry
【速读】:该论文旨在解决持续学习(continual learning)中的稳定性-可塑性权衡问题,即如何在保持已有知识稳定性的前提下有效学习新任务,避免任务间干扰。其核心挑战在于理解网络架构、任务相似性和表征维度如何共同影响学习过程中的结构分离与交互机制。解决方案的关键在于揭示了表征维度(representational dimensionality)作为关键组织变量的作用:在高维空间中,由于表征自由度充足,模块化架构的影响较小;而在低维(丰富)空间中,模块化网络能通过渐进式的子空间对齐、部分正交化和强分离策略,实现对不同任务的差异化表征组织,从而显著减少干扰并提升迁移效率。这一发现强调了自适应几何结构(adaptive geometry)是设计高效持续学习系统的核心原则。
链接: https://arxiv.org/abs/2604.27656
作者: Kathrin Korte,Joachim Winter Pedersen,Eleni Nisioti,Sebastian Risi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:To preserve previously learned representations, continual learning systems must strike a balance between plasticity, the ability to acquire new knowledge, and stability. This stability-plasticity dilemma affects how representations can be reused across tasks: shared structure enables transfer when tasks are similar but may also induce interference when new learning disrupts existing representations. However, it remains unclear when and why structural separation influences this trade-off. In this study, we examine how network architecture, task similarity, and representational dimensionality jointly shape learning in a sequential task paradigm inspired by transfer-interference studies. We compare a task-partitioned modular recurrent network with a single-module baseline by systematically varying task similarity (low, medium, high) and the scale of weight initialization, which induces different learning regimes that we empirically characterize through the effective dimensionality of the learned representations. We find that architecture has minimal impact in high-dimensional regimes where representations are sufficiently unconstrained to accommodate multiple tasks without strong interference. In contrast, in lower-dimensional (rich) regimes, architectural separation is decisive: modular networks exhibit graded alignment of task-specific subspaces with overlap for similar tasks, partial orthogonalization for moderately dissimilar tasks, and stronger separation for dissimilar tasks. This graded geometry is absent in the single network baseline. Our findings suggest that representational dimensionality acts as a key organizing variable governing when structural separation becomes functionally relevant, and highlight adaptive geometry as a central principle for designing continual learning systems.
[AI-56] ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning
【速读】:该论文旨在解决如何让语言模型在无监督条件下自主生成可验证的问题、求解这些问题,并利用反馈实现自我改进的挑战,从而突破传统“学习回答”范式。其核心解决方案是提出ANCORA框架,关键在于三个机制:一是两级群体相对更新机制,将问题生成器(Proposer)的优势与求解器(Solver)的优势跨规格和跨尝试进行耦合;二是迭代自蒸馏监督微调(SFT),在强化学习前将基础模型投影到有效输出流形上以增强稳定性;三是UCB引导的课程图(Curriculum DAG),仅通过严格筛选且经求解器验证的新颖规范逐步扩展训练任务。这些机制共同防止了稀疏验证反馈导致的问题生成器崩溃,使模型在零样本评估下显著提升性能,例如在Verus环境中将Dafny2Verus的pass@1从26.6%提升至81.5%。
链接: https://arxiv.org/abs/2604.27644
作者: Chengcao Yang,Jun Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in which a unified policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions. ANCORA rests on three load-bearing mechanisms: a two-level group-relative update that couples Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT that projects the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG that grows only through strictly filtered, novel, Solver-verified specifications. These stabilizers are necessary because sparse verifier feedback otherwise drives Proposer collapse even under MLRL-aligned rewards. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in the test-time-training setting under 0-shot evaluation, outperforming the PSV self-play baseline by 15.8 points despite PSV using 1-shot inference; in a separate transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.
[AI-57] HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成通用验证方法学(Universal Verification Methodology, UVM)测试平台和序列时因硬件描述语言(Hardware Description Languages, HDLs)训练数据稀缺而导致代码错误的问题。其核心解决方案是提出HAVEN(Hybrid Automated Verification ENgine),关键在于通过分层架构避免LLMs直接编写HDL代码:首先利用LLM代理分析设计规范生成结构化架构方案,再由模板引擎结合预定义的协议特定Jinja2模板自动生成具备正确总线握手时序的UVM组件;其次引入协议感知的序列领域特定语言(Protocol-Aware Sequence Domain-Specific Language, DSL),将序列分解为细粒度步骤类型,并基于规则驱动的代码生成器实现高覆盖率的初始序列构建,随后迭代调用LLM代理分析覆盖率缺口报告并补充针对性DSL序列。此方法显著提升了编译成功率、代码覆盖率与功能覆盖率,达到当前LLM辅助测试平台生成系统的最先进水平。
链接: https://arxiv.org/abs/2604.27643
作者: Chang-Chih Meng,Yu-Ren Lu,Guan-Yu Lin,Tsung Tai Yeh,Kai-Chiang Wu,I-Chen Wu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 5 tables
Abstract:Integrated Circuit (IC) verification consumes nearly 70% of the IC development cycle, and recent research leverages Large Language Models (LLMs) to automatically generate testbenches and reduce verification overhead. However, LLMs have difficulty generating testbenches correctly. Unlike high-level programming languages, Hardware Description Languages (HDLs) are extremely rare in LLMs training data, leading LLMs to produce incorrect code. To overcome challenges when using LLMs to generate Universal Verification Methodology (UVM) testbenches and sequences, wepropose HAVEN (Hybrid Automated Verification ENgine) to prevent LLMs from writing HDL directly. For UVM testbench generation, HAVEN utilizes LLM agents to analyze design specifications to produce a structured architectural plan. The HAVEN Template Engine then combines with predefined and protocol-specific templates to generate all UVM components with correct bus-handshake timing. For UVM sequence generation, HAVEN introduces a Protocol-Aware Sequence Domain-Specific Language (DSL) that decomposes sequences into fine-grained step types. A set of predefined DSL patterns first establishes sequences that achieve a high coverage rate without LLM involvement. HAVEN continues to improve the coverage rate by iteratively leveraging LLM agents to analyze coverage gap reports and compose additional targeted DSL sequences. Unlike previous works, HAVEN is the first system that utilizes pre-defined, protocol-specific Jinja2 templates to generate all UVM components and UVM sequences using our proposed Protocol-Aware DSL and rule-based code generator. Our experimental results on 19 open-source IP designs spanning three interface protocols (Direct, Wishbone, AXI4-Lite) show that HAVEN achieves 100% compilation success, 90.6% code coverage, and 87.9% functional coverage on average, and is SOTA among LLM-assisted testbench generation systems.
[AI-58] Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)评估框架中使用静态提示模板(static prompt template)导致评估结果不可靠的问题,这与工业界实践中针对每个模型进行提示优化(prompt optimization, PO)以最大化应用性能的常规做法不一致。解决方案的关键在于引入针对每个模型独立进行提示优化的评估流程,实验证明该方法显著影响模型排名,凸显了在模型选择过程中对每个模型执行个性化提示优化的重要性。
链接: https://arxiv.org/abs/2604.27637
作者: Nicholas Sadjoli,Tim Siefken,Atin Ghosh,Yifan Mai,Daniel Dahlmeier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Abstract:Current Large Language Model (LLM) evaluation frameworks utilize the same static prompt template across all models under evaluation. This differs from the common industry practice of using prompt optimization (PO) techniques to optimize the prompt for each model to maximize application performance. In this paper, we investigate the effect of PO towards LLM evaluations. Our results on public academic and internal industry benchmarks show that PO greatly affects the final ranking of models. This highlights the importance of practitioners performing PO per model when conducting evaluations to choose the best model for a given task.
[AI-59] Generative structure search for efficient and diverse discovery of molecular and crystal structures
【速读】:该论文旨在解决在分子和材料发现中,如何高效预测稳定及亚稳态结构的问题,尤其针对高维能量景观搜索成本高昂、现有深度生成模型受限于训练数据分布且难以探索稀有但物理上重要的局部极小值这一挑战。其解决方案的关键在于提出了一种统一的生成式结构搜索(Generative Structure Search, GSS)框架,将基于扩散的生成与随机结构搜索(Random Structure Search, RSS)视为由学习到的梯度场(score field)和物理力共同驱动的采样过程的不同极限情形;通过耦合数据先验引导的生成与能量驱动的局部极小值探索,GSS 在保持对稀有结构敏感性的同时显著降低采样成本,在多种体系中实现了比RSS更高效且覆盖更广的结构发现能力,即使对于训练分布外的组分也具有有效性。
链接: https://arxiv.org/abs/2604.27636
作者: Yifang Qin,Yu Shi,Junfu Tan,Chang Liu,Ming Zhang,Ziheng Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching high-dimensional energy landscapes. Deep generative models offer efficient structure sampling, yet their outputs remain shaped by training data and can underexplore minima that are rare but physically relevant. We introduce generative structure search (GSS), a unified framework that formulates diffusion-based generation and random structure search (RSS) as limiting regimes of a common sampling process driven by learned score fields and physical forces. Coupling these drivers lets GSS use data priors to accelerate sampling while retaining energy-guided exploration of local minima. Across molecular and crystalline systems, GSS recovers diverse metastable structures with more than tenfold lower sampling cost than RSS for broad coverage and remains effective for compositions outside the training distribution. The results establish a physically grounded generative search strategy for discovering structures beyond the reach of data-driven sampling alone.
[AI-60] Political Bias Audits of LLM s Capture Sycophancy to the Inferred Auditor
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)政治偏见评估中存在的核心问题:现有基于固定问卷的审计方法是否真正反映模型的内在意识形态倾向,还是仅仅捕捉了模型对用户身份的迎合行为(sycophantic accommodation)。研究表明,标准政治偏见测试结果在很大程度上并非源于模型固有的左倾立场,而是由于模型根据推断出的提问者身份调整回答策略。解决方案的关键在于设计了一个因子实验(factorial experiment),在控制其他变量的前提下,仅改变提问者的自我陈述身份(如保守派共和党人或进步派民主党人),并系统测量六种前沿LLM在三种主流政治测评工具下的响应变化。结果显示,当提问者自称为保守派时,所有模型的回答显著右移,且这种右倾调整幅度是左倾调整的8倍;而当提问者为进步派时,响应几乎不变。这表明模型的政治偏见不是静态属性,而是一种动态响应模式,需在不同真实交互情境中进行建模与评估。
链接: https://arxiv.org/abs/2604.27633
作者: Petter Törnberg,Michelle Schimmel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are commonly evaluated for political bias based on their responses to fixed questionnaires, which typically place frontier models on the political left. A parallel literature shows that LLMs are sycophantic: they adapt their answers to the views, identities, and expectations of the user. We show that these findings are linked: standard political-bias audits partly capture sycophantic accommodation to the inferred auditor. We employ a factorial experiment across three major audit instruments–the Political Compass Test, the Pew Political Typology, and 1,540 partisan-benchmarked Pew American Trends Panel items–administered to six frontier LLMs while varying only the asker’s stated identity (N = 30,990 responses). At baseline, all six models lean left. When the asker identifies as a conservative Republican, responses shift sharply: the share of items closer to Democrats falls by 28-62 percentage points, and all six models move right of center. A mirror-image progressive-Democrat cue produces little change; rightward accommodation is 8.0 \times larger than leftward. When asked who the default asker is, models identify an auditor, researcher, or academic; when asked what answer that asker expects, they select the Democrat-coded option 75% of the time, nearly the rate under an explicit progressive cue. These patterns are inconsistent with a purely fixed model ideology and indicate that single-prompt audits capture an interaction between model and inferred interlocutor. Political bias in LLMs is therefore not a fixed point on an ideological scale but a response profile that must be mapped across realistic interlocutors.
[AI-61] WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
【速读】:该论文旨在解决半导体制造中晶圆缺陷视觉问答(Wafer Defect Visual Question Answering, Wafer VQA)任务因标注数据稀缺而导致模型训练困难的问题。其解决方案的关键在于提出了一种三阶段合成管道:首先通过基于聚类的清洗方法过滤标签噪声,接着利用视觉语言模型生成结构化的缺陷描述并转化为评估标准(rubric),最后基于这些rubric合成高质量的VQA对,确保覆盖缺陷类型识别、空间分布、形貌特征及根本原因分析等维度;同时引入双评估框架,结合规则基指标与LLM-Judge评分,并通过贝叶斯优化对齐二者,从而实现可靠自动化评估,并借助课程强化学习与rubric对齐奖励机制,在仅4B参数的小型Qwen3-VL模型上达到接近Gemini-3-Flash的性能,验证了领域特定微调的小模型在工业视觉理解任务中的优越性。
链接: https://arxiv.org/abs/2604.27629
作者: Ke Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, 8 tables
Abstract:We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing. Comments: 16 pages, 3 figures, 8 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27629 [cs.AI] (or arXiv:2604.27629v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27629 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-62] race-Level Analysis of Information Contamination in Multi-Agent Systems
【速读】:该论文旨在解决结构化智能体工作流中不确定性传播与污染问题,即在处理异构信息源(如PDF、电子表格等)时,输入层面的扰动如何通过中间状态和执行路径引发不可预测的行为偏差,进而影响最终输出的准确性。其核心解决方案在于将不确定性视为可控变量:通过向 artifact 衍生的表示注入结构化扰动,在固定工作流下进行系统性日志记录,并基于计划、工具调用及中间状态的轨迹差异量化污染程度。研究发现,工作流可能在结构上保持一致但产生错误结果(沉默语义污染),也可能经历行为偏移后恢复正确答案(行为绕行并恢复),或同时出现结构破坏与控制流异常(如重路由、延长执行、提前终止)。这一发现揭示了传统验证机制失效的根本原因,并提出了一个形式化的污染表现分类体系、基于轨迹的检测与定位框架以及面向针对性验证、防御性设计和成本控制的实证依据。
链接: https://arxiv.org/abs/2604.27586
作者: Anna Mazhar,Huzaifa Suri,Sainyam Galhotra
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reasoning over heterogeneous artifacts (PDFs, spreadsheets, slide decks, etc.) increasingly occurs within structured agent workflows that iteratively extract, transform, and reference external information. In these workflows, uncertainty is not merely an input-quality issue: it can redirect decomposition and routing decisions, reshape intermediate state, and produce qualitatively different execution trajectories. We study this phenomenon by treating uncertainty as a controlled variable: we inject structured perturbations into artifact-derived representations, execute fixed workflows under comprehensive logging, and quantify contamination via trace divergence in plans, tool invocations, and intermediate state. Across 614 paired runs on 32 GAIA tasks with three different language models, we find a decoupling: workflows may diverge substantially yet recover correct answers, or remain structurally similar while producing incorrect outputs. We characterize three manifestation types: silent semantic corruption, behavioral detours with recovery, and combined structural disruption and their control-flow signatures (rerouting, extended execution, early termination). We measure operational costs and characterize why commonly used verification guardrails fail to intercept contamination. We contribute (i) a formal taxonomy of contamination manifestations in structured workflows, (ii) a trace-based measurement framework for detecting and localizing contamination across agent interactions, and (iii) empirical evidence with implications for targeted verification, defensive design, and cost control.
[AI-63] Statistical Channel Fingerprint Construction for Massive MIMO: A Unified Tensor Learning Framework
【速读】:该论文旨在解决大规模多输入多输出(Massive MIMO)通信系统中获取信道状态信息(CSI)时面临的高测量成本、隐私与安全约束问题,特别是如何高效构建统计信道指纹(sCF),即存储于每个潜在位置的统计CSI(sCSI)表示。其核心挑战在于如何在有限观测条件下重建高质量的sCF,并保持计算效率。解决方案的关键在于提出一种统一的张量学习架构LPWTNet,该架构通过闭式拉普拉斯金字塔(Laplacian Pyramid, LP)分解与重构框架替代传统编码器-解码器结构,从而在不增加参数量的前提下捕捉sCF的多尺度频带特征;同时引入共享掩码学习策略以自适应优化高频分量,并设计基于小核卷积与小波变换(Wavelet Transform, WT)的机制,在解耦不同频率成分的同时增强特征提取效率,最终实现多种实际场景下的高精度与低复杂度sCF重建。
链接: https://arxiv.org/abs/2604.27574
作者: Zhenzhou Jin,Li You,Xiang-Gen Xia,Xiqi Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: 15 pages, 7 figures
Abstract:Channel fingerprint (CF) is considered a key enabler for facilitating the acquisition of channel state information (CSI) in massive multiple-input multiple-output (MIMO) communication systems. In this work, we investigate a novel type of CF that stores statistical CSI (sCSI) at each potential location, referred to as statistical CF (sCF). Specifically, we reveal the relationship between sCSI, namely the channel spatial covariance matrix (CSCM), and the channel power angular spectrum (CPAS). Building on this foundation, we construct a unified tensor representation of the sCF and further reduce its dimension by exploiting the eigenvalue decomposition of the CSCM and its correlation with the PAS. Considering the practical constraints imposed by measurement cost, privacy, and security, we focus on three representative scenarios and uniformly formulate them as tensor restoration tasks. To this end, we propose a unified tensor-based learning architecture, termed LPWTNet. The architecture incorporates a closed-form Laplacian pyramid (LP) decomposition and reconstruction framework that replaces the traditional encoder-decoder structure, enabling efficient inference while capturing multi-scale frequency subband characteristics of the sCF. Additionally, a shared mask learning strategy is introduced to adaptively refine high-frequency sCF components through level-wise adjustments. To achieve a larger receptive field without over-parameterization, we further propose a small-kernel convolution mechanism based on the wavelet transform (WT), which decouples convolution across different frequency components of the sCF and enhances feature extraction efficiency. Extensive experiments show that the proposed approach delivers competitive reconstruction accuracy and computational efficiency across various sCF construction scenarios when compared with state-of-the-art baselines.
[AI-64] SpatialGrammar: A Domain-Specific Language for LLM -Based 3D Indoor Scene Generation
【速读】:该论文旨在解决从自然语言自动生成交互式3D室内场景时存在的空间错误和物体碰撞问题,这在虚拟现实、游戏和具身人工智能(Embodied AI)中尤为关键。现有基于大语言模型(Large Language Models, LLMs)的方法常因使用难以建模三维空间关系与物理约束的场景表示方式(如原始坐标或冗长代码)而导致生成结果不准确。其解决方案的关键在于提出一种领域特定语言——SpatialGrammar,它将重力对齐的室内布局表示为鸟瞰图(Bird’s Eye View, BEV)网格上的确定性放置,并可编译为合法的3D几何结构,从而支持可验证的约束检查;在此基础上构建了两个核心组件:SG-Agent(利用编译器反馈进行闭环迭代优化以强制满足碰撞约束)和SG-Mini(一个仅用编译器验证的合成数据训练的104M参数模型),二者共同显著提升了生成场景的空间保真度和物理合理性。
链接: https://arxiv.org/abs/2604.27555
作者: Song Tang,Kaiyong Zhao,Yuliang Li,Qingsong Yan,Penglei Sun,Junyi Zou,Qiang Wang,Xiaowen Chu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automatically generating interactive 3D indoor scenes from natural language is crucial for virtual reality, gaming, and embodied AI. However, existing LLM-based approaches often suffer from spatial errors and collisions, in part because common scene representations-raw coordinates or verbose code-are difficult for models to reason about 3D spatial relationships and physical constraints. We propose SpatialGrammar, a domain-specific language that represents gravity-aligned indoor layouts as BEV grid placements with deterministic compilation to valid 3D geometry, enabling verifiable constraint checking. Building on this representation, we develop (1) SG-Agent, a closed-loop system that uses compiler feedback to iteratively refine scenes and enforce collision constraints, and (2) SG-Mini, a 104M-parameter model trained entirely on compiler-validated synthetic data. Across 159 test scenes spanning five scenarios of different complexity, SG-Agent improves spatial fidelity and physical plausibility over prior methods, while SG-Mini performs competitively against larger LLM-based baselines on single-shot generation scenarios.
[AI-65] In-Context Examples Suppress Scientific Knowledge Recall in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学推理任务中因引入上下文示例而导致预训练领域知识被削弱的问题,即“知识位移”现象。研究表明,即使示例由目标公式生成,添加这些示例仍会促使模型从基于知识的推导转向经验性模式拟合,从而偏离科学推理的本质——从数据中揭示潜在结构(latent structure recovery)。解决方案的关键在于识别并量化这种知识位移效应:通过在五个科学领域、60项任务、6000次试验中系统验证,发现示例虽能改变模型行为,但其对准确性的影响取决于被取代策略与替代策略的相对优劣,且始终伴随着对知识驱动推理的偏离。这一发现警示科研实践者,在部署LLMs处理科学任务时,需谨慎评估示例是否真正增强而非削弱模型的知识应用能力。
链接: https://arxiv.org/abs/2604.27540
作者: Chaemin Jang,Woojin Park,Hyeok Yun,Dongman Lee,Jihee Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific reasoning rarely stops at what is directly observable; it often requires uncovering hidden structure from data. From estimating reaction constants in chemistry to inferring demand elasticities in economics, this latent structure recovery is what distinguishes scientific reasoning from curve fitting. Large language models (LLMs) can often recall and apply relevant scientific formulas, but we show that this ability is surprisingly easy to suppress. We show that adding in-context examples makes models rely less on pretrained domain knowledge, even when those examples are generated by the very same formula. Rather than reinforcing knowledge-driven derivation, examples shift computation toward empirical pattern fitting. We document this knowledge displacement on 60 latent structure recovery tasks across five scientific domains, 6,000 trials, and four models. This displacement is consistent across domains, but its accuracy consequences depend on how the displaced strategy compares to the one that replaces it: the same shift can lower accuracy, leave it unchanged, or appear to improve it. In all cases, however, the model shifts away from knowledge-driven reasoning. For practitioners deploying LLMs on scientific tasks, the message is cautionary: in-context examples may displace, rather than reinforce, the knowledge they are intended to support.
[AI-66] Belief-Guided Inference Control for Large Language Model Services via Verifiable Observations ACL2026
【速读】:该论文旨在解决黑盒大语言模型(Large Language Model, LLM)服务中响应可靠性与计算成本之间的权衡问题,即在请求处理时如何动态决定是否使用低成本默认输出或触发高成本推理路径以提升质量。其核心解决方案是提出Veroic框架,通过构建一个轻量级的可验证观测通道,将输入-输出对中的异构质量信号聚合为关于潜在响应可靠性的信念状态(belief state),并基于此状态设计预算感知策略,在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)框架下实现风险感知的自适应推理控制。
链接: https://arxiv.org/abs/2604.27536
作者: Wenhao Yuan,Chenchen Lin,Jian Chen,Jinfeng Xu,Shuo Yang,Edith Cheuk Han Ngai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KnowFM@ACL2026
Abstract:In black-box large language model (LLM) services, response reliability is often only partially observable at decision time, while stronger inference pathways incur substantial computational cost, inducing a budgeted sequential decision problem: for each request, the system should decide whether the default low-cost response is sufficiently reliable or whether additional computation should be allocated to improve response quality. In this paper, we propose \textbfVerifiable \textbfObservations for Risk-aware \textbfInference \textbfControl (\textscVeroic), a framework for adaptive inference control in black-box LLM settings, which formulates request-time control as a \textitpartially observable Markov decision process to capture partial observability and sequential budget coupling. It constructs a lightweight verifiable observation channel from the input-output pair by aggregating heterogeneous quality signals into a belief state over latent response reliability, which is then used by a budget-aware policy to decide whether to return the default output or trigger a higher-cost inference pathway. Experiments on diverse tasks show that \textscVeroic achieves improved quality-cost trade-offs, stronger risk estimation and calibration, and more robust long-horizon inference control than competitive baselines.
[AI-67] PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人控制中因采用监督式行为克隆(behavior cloning)预训练范式而忽视任务目标可达性与时间进度理解的问题,从而限制了长时程规划和零样本指令泛化能力。其解决方案的关键在于提出PRTS(Primitive Reasoning and Tasking System),通过将预训练重构为基于目标条件强化学习(Goal-Conditioned Reinforcement Learning)的框架,利用对比强化学习构建统一嵌入空间,使状态-动作与目标嵌入的内积近似表示对数折扣目标可达概率(log-discounted goal occupancy),从而定量评估物理可行性;该方法无需奖励标注即可从离线轨迹中提取密集的目标可达性监督信号,并通过角色感知因果掩码(role-aware causal mask)高效集成至视觉语言模型(VLM)骨干网络,显著提升策略对复杂、接触密集及长时程任务的执行成功率与泛化性能。
链接: https://arxiv.org/abs/2604.27472
作者: Yang Zhang,Jiangyuan Zhao,Chenyou Fan,Fangzheng Yan,Tian Li,Haitong Tang,Sen Fu,Xuan’er Wu,Qizhen Weng,Weinan Zhang,Xiu Li,Chi Zhang,Chenjia Bai,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 38 pages, 12 figures
Abstract:Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot learning as a goal-reaching process that requires understanding temporal task progress. We present \textbfPRTS (\textbfPrimitive \textbfReasoning and \textbfTasking \textbfSystem), a VLA foundation model that reformulates pretraining through Goal-Conditioned Reinforcement Learning. By treating language instructions as goals and employing contrastive reinforcement learning, PRTS learns a unified embedding space where the inner product of state-action and goal embeddings approximates the log-discounted goal occupancy, the probability of reaching the language-specified goal from the current state-action, quantitatively assessing physical feasibility beyond static semantic matching. PRTS draws this dense goal-reachability supervision directly from offline trajectories without reward annotations, and folds it into the VLM backbone via a role-aware causal mask, incurring negligible overhead over vanilla behavior cloning. This paradigm endows the high-level reasoning system with intrinsic goal reachability awareness, bridging semantic reasoning and temporal task progress, and further benefits goal-conditioned action prediction. Pretrained on 167B tokens of diverse manipulation and embodied-reasoning data, PRTS reaches state-of-the-art performance on LIBERO, LIBERO-Pro, LIBERO-Plus, SimplerEnv, and a real-world suite of 14 complex tasks, with particularly substantial gains on long-horizon, contact-rich, and zero-shot novel-instruction settings, confirming that injecting goal-reachability awareness significantly improves both execution success and long-horizon planning of general-purpose robotic foundation policies.
[AI-68] Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理框架(Autonomous Agent Frameworks)在安全风险方面的研究碎片化问题,缺乏系统性分层分析的现状。其关键解决方案是提出一个四层安全架构,分别从上下文与指令层、工具与动作层、状态与持久层、生态与自动化层对代理系统的安全风险进行结构化梳理,并针对每一层总结代表性威胁和防御策略,从而揭示跨层威胁传播机制(如从输入操纵到不安全动作、状态污染及生态系统级影响),并指出未来研究需关注的研究失衡、长期评估缺失和生态信任模型薄弱等挑战,推动更系统化和集成化的安全防护体系发展。
链接: https://arxiv.org/abs/2604.27464
作者: Luyao Xu,Xiang Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, 6 tables
Abstract:Autonomous agent frameworks built upon large language models (LLMs) are evolving into complex, tool-integrated, and continuously operating systems, introducing security risks beyond traditional prompt-level vulnerabilities. As this paradigm is still at an early stage of development, a timely and systematic understanding of its security implications is increasingly important. Although a growing body of work has examined different attack surfaces and defense problems in agent systems, existing studies remain scattered across individual aspects of agent security, and there is still a lack of a layered review on this topic. To address this gap, this survey presents a layered review of security risks and defense strategies in autonomous agent frameworks, with OpenClaw as a case study. We organize the analysis into four security-relevant layers: the context and instruction layer, the tool and action layer, the state and persistence layer, and the ecosystem and automation layer. For each layer, we summarize its functional role, representative security risks, and corresponding defense strategies. Based on this layered analysis, we further identify that threats in autonomous agent frameworks may propagate across layers, from manipulated inputs to unsafe actions, persistent state contamination, and broader ecosystem-level impact. Finally, we highlight potential key challenges, including research imbalance across layers, the lack of long-horizon evaluation, and weak ecosystem trust models, and outline future directions toward more systematic and integrated defenses.
[AI-69] Improving Graph Few-shot Learning with Hyperbolic Space and Denoising Diffusion
【速读】:该论文针对图少样本学习(Graph Few-Shot Learning)中存在的两个关键问题提出解决方案:一是元训练阶段在欧几里得空间中进行节点表示学习,难以捕捉真实图数据中固有的层次结构;二是元测试阶段仅基于少量支持样本拟合经验目标分布,而该分布可能与真实潜在分布存在显著偏差。解决方案的关键在于提出IMPRESS框架,通过在双曲空间(Hyperbolic Space)中学习节点表示以更好地建模层次结构,并利用去噪扩散机制(Denoising Diffusion)增强支持集分布,从而提升模型的泛化能力。理论分析表明,该方法可获得更紧的泛化界,实验验证其在多个基准数据集上均优于现有基线方法。
链接: https://arxiv.org/abs/2604.27462
作者: Yonghao Liu,Jialu Sun,Wei Pang,Fausto Giunchiglia,Ximing Li,Xiaoyue Feng,Renchu Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph few-shot learning, which focuses on effectively learning from only a small number of labeled nodes to quickly adapt to new tasks, has garnered significant research attention. Despite recent advances in graph few-shot learning that have demonstrated promising performance, existing methods still suffer from several key limitations. First, during the meta-training phase, these methods typically perform node representation learning in Euclidean space, which often fails to capture the inherently hierarchical structure existing in real-world graph data. Second, during the meta-testing phase, they usually fit an empirical target distribution derived from only a few support samples, even when this distribution significantly deviates from the true underlying distribution. To address these issues, we propose IMPRESS, a novel framework that IMproves graPh few-shot learning with hypeRbolic spacE and denoiSing diffuSion. Specifically, our model learns node representations in a hyperbolic space and enriches the support distribution through denoising diffusion mechanisms. Theoretically, IMPRESS achieves a tighter generalization bound. Empirically, IMPRESS consistently outperforms competitive baselines across multiple benchmark datasets.
[AI-70] RAY-TOLD: Ray-Based Latent Dynamics for Dense Dynamic Obstacle Avoidance with TDMPC
【速读】:该论文旨在解决高密度动态人群环境下自主移动机器人导航的挑战,尤其针对纯反应式规划方法(如基于模型预测路径积分的MPPI控制)在复杂场景中因预测时域有限而容易陷入局部最优的问题。解决方案的关键在于提出了一种混合控制架构——基于射线的任务导向潜在动力学(RAY-TOLD),其核心创新包括:利用激光雷达中心的潜在动力学模型将高维传感器数据编码为紧凑状态表示,从而学习终端价值函数和策略先验;同时引入策略混合采样策略,在MPPI候选轨迹中融合由学习得到的策略生成的轨迹,有效引导规划器向目标前进并保持运动学可行性。该方法结合了物理驱动的短时域滚动优化与强化学习带来的长时域意图感知,显著提升了导航的可靠性与安全性。
链接: https://arxiv.org/abs/2604.27450
作者: Seungho Han,Seokju Lee,Jeonguk Kang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:Dense, dynamic crowds pose a persistent challenge for autonomous mobile robots. Purely reactive planning methods, such as Model Predictive Path Integral (MPPI) control, often fail to escape local minima in complex scenarios due to their limited prediction horizon. To bridge this gap, we propose Ray-based Task-Oriented Latent Dynamics (RAY-TOLD), a hybrid control architecture that integrates obstacle information into latent dynamics and utilizes the robustness of physics-based MPPI with the long-horizon foresight of reinforcement learning. RAY-TOLD leverages a LiDAR-centric latent dynamics model to encode high-dimensional sensor data into a compact state representation, enabling the learning of a terminal value function and a policy prior. We introduce a policy mixture sampling strategy that augments the MPPI candidate population with trajectories derived from the learned policy, effectively guiding the planner towards the goal while maintaining kinematic feasibility. Extensive tests in a stochastic environment with high-density dynamic obstacles demonstrate that our method outperforms the MPPI baseline, reducing the collision rate. The results confirm that blending short-horizon physics-based rollouts with learned long-horizon intent significantly enhances navigation reliability and safety.
[AI-71] ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
【速读】:该论文旨在解决在连续时间和空间中生成随机过程(如视频、天气预报)时,如何基于部分观测条件(如首帧和末帧)进行建模的问题。现有方法(如扩散模型)存在三大局限:一是噪声到数据的演化无法捕捉物理时间相近状态间的结构相似性,且在低步数情况下积分不稳定;二是注入的随机噪声与物理时间流逝无关,导致动力学错误;三是难以对任意子集的状态(如不规则采样时间点或未来观测)进行条件建模。其解决方案的关键在于提出ABC(Any-Subset Autoregressive Models via Non-Markovian Diffusion Bridges),通过构建一个持续性的随机微分方程(SDE),使时间变量和中间状态能精确追踪真实时间和过程状态。该设计具有三个核心优势:(1)生成未来状态的起点为已接近的前一状态,而非无信息噪声;(2)噪声注入强度随物理时间流逝而缩放,促进符合物理规律的动力学;(3)通过路径空间上的测度变换推导出路径依赖的条件建模能力,支持对任意历史或未来状态子集的条件约束。
链接: https://arxiv.org/abs/2604.27443
作者: Gabe Guo,Thanawat Sornwanee,Lutong Hao,Elon Litman,Stefano Ermon,Jose Blanchet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating continuous-time, continuous-space stochastic processes (e.g., videos, weather forecasts) conditioned on partial observations (e.g., first and last frames) is a fundamental challenge. Existing approaches, (e.g., diffusion models), suffer from key limitations: (1) noise-to-data evolution fails to capture structural similarity between states close in physical time and has unstable integration in low-step regimes; (2) random noise injected is insensitive to the physical process’s time elapsed, resulting in incorrect dynamics; (3) they overlook conditioning on arbitrary subsets of states (e.g., irregularly sampled timesteps, future observations). We propose ABC: Any-Subset Autoregressive Models via Non-Markovian Diffusion Bridges in Continuous Time and Space. Crucially, we model the process with one continual SDE whose time variable and intermediate states track the real time and process states. This has provable advantages: (1) the starting point for generating future states is the already-close previous state, rather than uninformative noise; (2) random noise injection scales with physical time elapsed, encouraging physically plausible dynamics with similar time-adjacent states. We derive SDE dynamics via changes-of-measure on path space, yielding another advantage: (3) path-dependent conditioning on arbitrary subsets of the state history and/or future. To learn these dynamics, we derive a path- and time-dependent extension of denoising score matching. Our experiments show ABC’s superiority to competing methods on multiple domains, including video generation and weather forecasting.
[AI-72] AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在去中心化架构下易受投毒攻击(poisoning attacks)的问题,尤其是现有 Byzantine-robust 方法在应对多种攻击类型时难以实现平衡防御,或依赖服务器端拥有客户端数据集的局限性。其解决方案的关键在于提出一种多层自适应防御聚合机制(AdaBFL),该机制基于新颖的三层防御结构,能够动态调整不同防御算法的权重,从而有效应对复杂且多样化的恶意攻击行为;同时,在非凸目标函数和非独立同分布(non-iid)数据条件下,该方法具备理论收敛性保障。
链接: https://arxiv.org/abs/2604.27434
作者: Zehui Tang,Yuchen Liu,Feihu Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 24 pages
Abstract:Federated learning (FL) is a popular distributed learning paradigm in machine learning, which enables multiple clients to collaboratively train models under the guidance of a server without exposing private client data. However, FL’s decentralized nature makes it vulnerable to poisoning attacks, where malicious clients can submit corrupted models to manipulate the system. To counter such attacks, although various Byzantine-robust methods have been proposed, these methods struggle to provide balanced defense against multiple types of attacks or rely on possessing the dataset in the server. To deal with these drawbacks, thus, we propose an effective multi-layer defensive adaptive aggregation for Bzantine-robust federated learning (AdaBFL) based on a novel three-layer defensive mechanism, which can adaptively adjust the weights of defense algorithms to counter complex attacks. Moreover, we provide convergence properties of our AdaBFL method under the non-convex setting on non-iid data. Comprehensive experiments across multiple datasets validate the superiority of our AdaBFL over the comparable algorithms.
[AI-73] Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors
【速读】:该论文旨在解决本地离线微调(local fine-tuning)数据中敏感信息(如API密钥、个人身份标识和财务记录)在模型代码被攻破时仍可能被窃取的问题。现有基于预训练权重的被动投毒攻击因依赖概率语义前缀,难以捕获稀疏且高熵的目标秘密,无法有效应对此类场景。解决方案的关键在于识别并利用一个被忽视的供应链攻击向量——伪装成标准架构定义的模型代码,从而实现从被动权重污染到主动执行劫持的范式转变;其核心机制是通过在线张量规则匹配锁定动态计算流中的token级秘密,并借助值-梯度解耦技术隐蔽注入攻击梯度,克服梯度淹没问题以强制模型记忆秘密,同时首次实现通过黑盒查询验证攻击者可验证的秘密窃取,实验表明该方法在不损害主任务性能的前提下实现了超过98%的严格攻击成功率(Strict ASR),并能有效绕过差分隐私(DP-SGD)、语义审计和代码审计等防御措施。
链接: https://arxiv.org/abs/2604.27426
作者: Zi Li,Tian Zhou,Wenze Li,Jingyu Hua,Yunlong Mao,Sheng Zhong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Local fine-tuning datasets routinely contain sensitive secrets such as API keys, personal identifiers, and financial records. Although ‘‘local offline fine-tuning’’ is often viewed as a privacy boundary, we reveal that compromised model code is sufficient to steal them. Current passive pretrained-weight poisoning attacks, while effective for natural language, fundamentally fail to capture such sparse high-entropy targets due to their reliance on probabilistic semantic prefixes. To bridge this gap, we identify and exploit a practical but overlooked supply-chain vector – model code camouflaged as standard architectural definitions – to realize a paradigm shift from passive weight poisoning to active execution hijacking. We introduce a deterministic full-chain memorization mechanism: it locks onto token-level secrets in dynamic computation flows via online tensor-rule matching, and leverages value-gradient decoupling to stealthily inject attack gradients, overcoming gradient drowning to force model memorization. Furthermore, we achieve, for the first time, attacker-verifiable secret stealing through black-box queries that precisely distinguishes true leakage from hallucination. Experiments demonstrate that our method achieves over 98% Strict ASR without compromising the primary task, and can effectively bypass defense measures including DP-SGD, semantic auditing, and code auditing.
[AI-74] Robust Learning on Heterogeneous Graphs with Heterophily: A Graph Structure Learning Approach
【速读】:该论文旨在解决异质图(Heterogeneous Graph)中由于结构噪声导致的表示学习鲁棒性不足的问题,尤其是在节点类型和标签存在异质性(heterophily)且连接关系可能包含误导性或错误信息的情况下。其解决方案的关键在于提出一个统一框架——异质图统一学习(Heterogeneous Graph Unified Learning, HGUL),该框架通过三个互补模块协同建模:基于kNN的图构建模块用于恢复可靠的局部邻域结构;图结构学习模块自适应地过滤噪声边以优化邻接矩阵;以及异质亲和力学习模块,利用多项式图核扩展得到的亲和力矩阵捕捉类别层面的关系。这种联合建模异质性和噪声的方法显著提升了模型在干净图和不同噪声水平下的性能表现。
链接: https://arxiv.org/abs/2604.27387
作者: Yihan Zhang,Ercan E. Kuruoglu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Heterogeneous graphs with heterophily have emerged as a powerful abstraction for modeling complex real-world systems, where nodes of different types and labels interact in diverse and often non-homophilous ways. Despite recent advances, robust representation learning for such graphs remains largely unexplored, particularly in the presence of noisy or misleading connectivity. In this work, we investigate this problem and identify structural noise as a critical challenge that significantly degrades model performance. To address this issue, we propose a unified framework, Heterogeneous Graph Unified Learning (HGUL), which jointly handles heterophily and noisy graph structures. The framework consists of three complementary modules: a kNN-based graph construction module that recovers reliable local neighborhoods, a graph structure learning module that adaptively refines the adjacency by filtering noisy edges, and a heterogeneous affinity learning module that captures class-level relationships via an extended affinity matrix derived from a polynomial graph kernel. Extensive experiments on multiple datasets demonstrate that HGUL consistently outperforms existing methods on clean graphs and maintains strong robustness under varying levels of structural noise. The results further underscore the importance of jointly modeling heterophily and noise in heterogeneous graph learning.
[AI-75] Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
【速读】:该论文旨在解决高风险场景下大型语言模型(Large Language Model, LLM)代理在运行时动态分配子任务给专业化子代理时的安全性与效率权衡问题。现有方法仅在设计阶段选择多代理架构或提供宽泛的经验准则,缺乏根据任务上下文变化实时调整安全-效率平衡的机制。其解决方案的核心是提出Safe Bilevel Delegation (SBD) 框架,将任务委派建模为双层优化问题:外层元权重网络 φ 学习随上下文变化的安全-效率权重 λ(s) ∈ [0,1],内层优化委托策略 π 并满足概率性安全约束 P(safe) = 1−δ;通过连续委托度 α ∈ [0,1] 控制决策权移交程度,实现从完全人工干预(α=0)到全自主执行(α=1)的平滑过渡。该框架还提供了三项理论保障:安全性单调性、内层策略收敛性以及责任传播边界,从而确保多跳委托链中每个代理的责任可量化且不超过上限。
链接: https://arxiv.org/abs/2604.27358
作者: Yuan Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language model (LLM) agents are deployed in high-stakes environments, the question of how safely to delegate subtasks to specialized sub-agents becomes critical. Existing work addresses multi-agent architecture selection at design time or provides broad empirical guidelines, but neither provides a runtime mechanism that dynamically adjusts the safety-efficiency trade-off as task context changes during execution. We propose Safe Bilevel Delegation (SBD), a formal framework for runtime delegation safety in hierarchical multi-agent systems. SBD formulates task delegation as a bilevel optimization problem: an outer meta-weight network phi learns context-dependent safety-efficiency weights lambda(s) in [0,1]; an inner loop optimizes the delegation policy pi subject to a probabilistic safety constraint P(safe) = 1-delta. The continuous delegation degree alpha in [0, 1] controls how much decision authority is transferred to each sub-agent, interpolating smoothly between full human override (alpha=0) and fully autonomous execution (alpha=1). We establish three theoretical results: (1) Safety Monotonicity–higher outer safety weight produces a weakly safer inner policy; (2) Inner Policy Convergence–projected gradient descent on the inner problem converges linearly under standard smoothness assumptions; (3) an Accountability Propagation bound that distributes responsibility across multi-hop delegation chains with a provable per-agent ceiling. We instantiate SBD in three high-stakes domains–medical AI (MIMIC-III), financial risk control (S and P 500), and educational agent supervision (ASSISTments)–specifying datasets, safety constraint sets, baselines, and evaluation protocols. This manuscript presents the formal framework and theoretical results in full; empirical validation following the protocols described herein is planned and will be reported in a forthcoming revision. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27358 [cs.AI] (or arXiv:2604.27358v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.27358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-76] ypeBandit: Type-Level Context Allocation and Reweighting for Effective Attribute Completion in Heterogeneous Graph Neural Networks
【速读】:该论文旨在解决异质图(heterogeneous graph)中缺失节点属性的问题,其核心挑战在于不同节点类型在属性补全任务中提供的有用信号存在显著差异,即“类型依赖的信息不对称”(type-dependent information asymmetry)。为应对这一问题,作者提出了一种轻量级、模型无关的解决方案 TypeBandit,其关键在于:通过拓扑感知初始化、类型级别的 bandit 采样机制与联合表示学习相结合,在有限的全局采样预算下,自适应地为每种节点类型选取代表性样本,并将这些类型摘要作为共享上下文信号用于表示构建。该方法不依赖特定的图神经网络架构,而是作为前置模块适配多种异质图神经网络(如 R-GCN、HetGNN、HGT 等),同时引入结构度先验与特征传播相结合的混合预训练策略以提升初始化可靠性,从而在资源受限且信息分布不均的情况下实现高效、稳定的属性补全性能。
链接: https://arxiv.org/abs/2604.27356
作者: Ta-Yang Wang,Rajgopal Kannan,Viktor Prasanna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures
Abstract:Heterogeneous graphs are widely used to model multi-relational systems, but missing node attributes remain a major bottleneck for downstream learning. In this paper, we identify and formalize type-dependent information asymmetry: the phenomenon that different node types provide substantially different levels of useful signal for attribute completion. Motivated by this observation, we propose TypeBandit, a lightweight, model-agnostic methodology for heterogeneous attribute completion. TypeBandit combines topology-aware initialization, type-level bandit sampling, and joint representation learning. It allocates a finite global sampling budget across node types, samples representative nodes within each type, and uses the resulting sampled type summaries as shared contextual signals during representation construction. By operating at the type level rather than over each target node’s local neighborhood, TypeBandit keeps the adaptive state compact and practical for large heterogeneous graphs. A key advantage of TypeBandit is architectural flexibility. Rather than requiring a new heterogeneous graph neural network architecture, TypeBandit acts as a type-aware front end for representative heterogeneous GNN backbones, including R-GCN, HetGNN, HGT, and SimpleHGN. We further introduce a hybrid pretraining scheme that combines structural degree priors with feature propagation, yielding a more reliable initializer than degree-only pretraining. Under a fixed-split protocol on DBLP, IMDB, and ACM, TypeBandit provides dataset-dependent but practically meaningful gains. Additional ablation, stability, efficiency, semantic-propagation, and sampled OGBN-MAG experiments support TypeBandit as a practical strategy for heterogeneous attribute completion when type-specific information is unevenly distributed and sampling resources are limited. Comments: 17 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27356 [cs.LG] (or arXiv:2604.27356v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27356 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-77] CoAX: Cognitive-Oriented Attribution eXplanation User Model of Human Understanding of AI Explanations
【速读】:该论文试图解决的问题是:尽管可解释人工智能(Explainable AI, XAI)旨在提升用户对AI模型的理解与决策能力,但现有研究表明这一目标仍难以实现。其核心挑战在于人类认知机制与XAI方法之间的不匹配,导致用户无法有效利用解释信息进行推理。解决方案的关键在于通过认知建模识别并量化不同XAI方法(如无解释、特征重要性、特征归因)下用户在结构化数据上的推理策略,并以模型拟合人类决策行为的方式验证这些策略的有效性。研究进一步表明,基于认知建模的仿真能够替代高成本的人类实验,用于生成可检验的研究假设,从而为改进XAI解释设计提供实证依据,推动更易用、可理解的AI解释系统的发展。
链接: https://arxiv.org/abs/2604.27354
作者: Louth Bin Rawshan,Zhuoyu Wang,Brian Y. Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Explainable AI (XAI) aims to improve user understanding and decisions when using AI models. However, despite innovations in XAI, recent user evaluations reveal that this goal remains elusive. Understanding human cognition can help explain why users struggle to effectively use AI explanations. Focusing on reasoning on structured (tabular) data, we examined various reasoning strategies for different XAI methods (none, feature importance, feature attribution) in the decision task of anticipating AI decisions (i.e., forward simulation). We i) elicited reasoning strategies from a formative user study, and ii) collected decisions from a summative user study. Using cognitive modeling, we implemented the processes underlying each reasoning strategy and evaluated their alignment with human decision-making. We found that our models better fit human decisions than baseline machine learning proxies, providing insights into which reasoning strategies are (in)effective. We then demonstrate how the fitted model can be used to form hypotheses and investigate research questions that are costly to study with real human participants. This work contributes to debugging human understanding of XAI, informing the future development of more usable and interpretable AI explanations.
[AI-78] Profiles of AI Dependency: A Latent Class Analysis of Filipino Students Academic Competencies
【速读】:该论文试图解决菲律宾大学生对人工智能(Artificial Intelligence, AI)依赖程度日益增高所带来的基础学术能力下降问题,特别是其对批判性思维、写作能力、学习独立性、研究技能和学术参与度的潜在负面影响。解决方案的关键在于通过识别不同AI使用模式(如高度依赖型、选择性使用者等),提出教育政策应强化AI素养教育,并在课程设计中平衡技术应用与核心学术能力培养,以促进批判性思维发展和AI的伦理使用,从而缓解因过度依赖AI导致的学术技能退化风险。
链接: https://arxiv.org/abs/2604.27349
作者: Emerson Q. Fernando,Julius Ceazar G. Tolentino,Maria Anna D. Cruz,Jordan L. Salenga,Vernon Grace M. Maniago,Juvy C. Grume,Erika M. Pineda,Aileen P. De Leon,John Paul P. Miranda
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 table, conference proceedings, open access
Abstract:The increasing dependency among Filipino college students on artificial intelligence (AI) poses concerns about the potential decline of fundamental academic competencies. This study examines the extent of AI dependency and its perceived effects on students’ critical thinking, writing skills, learning independence, research skills, and academic engagement. Using a cross-sectional research design, data was collected from 651 students enrolled in higher education institutions (HEIs) in Pampanga, Philippines accredited by the Commission on Higher Education. The survey data was analyzed using Latent Class Analysis (LCA) to identify AI dependency patterns. Findings indicated that students show moderate to high AI dependency, specifically in research and writing tasks. LCA identified four distinct profiles: highly engaged independent learners, selective AI users, moderate AI users, and AI-dependent learners. Notably, AI-dependent learners demonstrated the weakest academic competencies, with significant dependency on AI-generated outputs. The study highlights the need to foster educational policies that integrate AI literacy while preserving essential academic skills. HEIs must also balance technological advancements with curriculum adaptations to promote critical thinking and ethical use of AI. Future research may explore the longitudinal impacts and intervention strategies to mitigate academic skill erosion caused by AI dependency.
[AI-79] Exploring the Adoption Intention in Using AI-Enabled Educational Tools Among Preservice Teachers in the Philippines: A Partial-Least Square Modeling
【速读】:该论文旨在解决职前教师在实习期间使用人工智能赋能教育工具(AI-enabled educational tools)的行为意向影响因素问题。其解决方案的关键在于基于统一技术接受与使用理论2(UTAUT2)构建模型,并引入计算机自我效能感、计算机焦虑和计算机趣味性等额外预测变量,通过结构方程建模发现:内在动机、认知和情感因素(如绩效期望和享乐动机)对行为意向具有最强预测作用,而外部或制度因素(如社会影响和促进条件)则作用有限甚至呈负向关系。因此,提升职前教师对技术的个人相关性感知、自信心和使用愉悦感是推动AI工具融入教师培养项目的核心策略。
链接: https://arxiv.org/abs/2604.27346
作者: Vanessa B. Sibug,Emerson Q. Fernando,Almer B. Gamboa,Roque Francis B. Dianelo,Agnes R. Regala,Joseph Alexander Bansil,Jan Henry B. Sunga,Vernon Grace M. Maniago,John Paul P. Miranda
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 tables, conference proceedings
Abstract:This study examines the factors influencing pre-service teachers’ behavioral intention to use AI-enabled educational tools during their practicum, using the Unified Theory of Acceptance and Use of Technology 2 (UTAUT2) as the theoretical framework. The model includes the core UTAUT2 constructs such as performance expectancy, effort expectancy, hedonic motivation, social influence, facilitating conditions, price value, and habit. It also incorporates additional predictors including computer self-efficacy, computer anxiety, and computer playfulness. Data were collected from 563 pre-service teachers using a structured questionnaire and analyzed using Partial Least Squares Structural Equation Modeling (PLS-SEM). The results indicate that performance expectancy and hedonic motivation are the strongest predictors of behavioral intention. Computer self-efficacy, computer anxiety, and computer playfulness significantly influenced effort expectancy, although effort expectancy did not directly predict behavioral intention. Performance expectancy was significantly predicted by extrinsic motivation, job fit, relative advantage, and outcome expectations. Constructs such as social influence and facilitating conditions showed limited or inverse effects. These findings suggest that internal motivational, cognitive, and emotional factors are more influential than external or institutional factors in shaping the adoption of AI-enabled tools. The study highlights the importance of promoting personal relevance, confidence, and enjoyment in teacher preparation programs to encourage technology integration.
[AI-80] Investigating More Explainable and Partition-Free Compositionality Estimation for LLM s: A Rule-Generation Perspective ACL2026
【速读】:该论文旨在解决现有组合泛化测试方法在评估大语言模型(Large Language Models, LLMs)组合性时存在的两个核心问题:一是仅关注输出结果而忽视模型对样本组合性的理解,导致可解释性不足;二是依赖数据集划分构造训练集中未见的组合测试集,易引发组合泄露(combination leakage)问题。解决方案的关键在于提出一种全新的规则生成视角(rule-generation perspective),要求LLMs生成用于数据映射的程序化规则,并基于复杂性理论对LLM的组合性进行量化估计。该方法不仅克服了传统测试的局限性,还为分析LLM的组合性特征提供了新范式。
链接: https://arxiv.org/abs/2604.27340
作者: Ziyao Xu,Cong Wang,Houfeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 main conference
Abstract:Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs’ understanding of sample compositionality, resulting in explainability defects; (2) they rely on dataset partition to form the test set with combinations unseen in the training set, suffering from combination leakage issues. In this work, we propose a novel rule-generation perspective for compositionality estimation for LLMs. It requires LLMs to generate a program as rules for dataset mapping and provides estimates of the compositionality of LLMs using complexity-based theory. The perspective addresses the limitations of compositional generalization tests and provides a new way to analyze the compositionality characterization of LLMs. We conduct experiments and analysis of existing advanced LLMs based on this perspective on a string-to-grid task, and find various compositionality characterizations and compositionality deficiencies exhibited by LLMs.
[AI-81] Prag mos: A Process Agent ic Modeling System
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的业务流程建模方法中存在的黑箱化与复杂依赖推理不足的问题。现有方法多采用端到端自动化或交互式聊天机器人,难以应对流程建模中固有的复杂性及对可解释性的要求。论文提出的关键解决方案是构建一个分步、可解释且人机协同的混合建模框架:将建模任务分解为一系列可管理的步骤,每一步生成中间产物并明确记录决策依据;同时引入领域内专门工具来结构化处理行为关系,从而弥补LLM在复杂依赖推理上的局限。该方法通过透明、渐进式的协作流程,生成既合理又易于理解的过程模型,所提出的原型系统Pragmos验证了这一思路的可行性。
链接: https://arxiv.org/abs/2604.27311
作者: Pedro-Aarón Hernández-Ávalos,Luciano García-Bañuelos
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:The advent of Large Language Models (LLMs) has significantly transformed tasks across Software Engineering. In the context of Business Process Management, LLMs are now being explored as tools to derive process models directly from textual descriptions. Existing approaches range from chatbot-driven systems that assist with iterative, text-based modeling to fully automated end-to-end modeling assistants. However, we argue that process modeling is inherently complex and cannot be effectively addressed through black-box solutions. Instead, we envision modeling as an open-ended conversational activity, best supported by an interactive, iterative process involving both humans and LLM. In our approach, the modeling task is decomposed into smaller, manageable steps. Each step results in intermediate artifacts and explicitly documents the rationale behind each modeling decision. During this process, we incrementally uncover simple behavioral relations that guide the construction of the model. Given the current limitations of LLMs in reasoning about complex dependencies, we complement them with specialized tools developed in the field to structure process models based on behavioral relations. This hybrid approach enables the generation of sound, yet comprehensible models that evolve through transparent and explainable steps. In this paper, we present our research agenda and introduce Pragmos, a prototype system that operationalizes this vision. Pragmos demonstrates how LLMs can collaborate with human users as both domain and modeling experts to co-create evolving process models through a structured and explainable workflow.
[AI-82] End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians
【速读】:该论文旨在解决临床人工智能(Clinical AI)系统在部署后缺乏持续治理的问题,即如何实现对已上线系统的持续监控、评估、迭代与再评估,以确保其性能稳定性和安全性。解决方案的关键在于构建一个端到端的治理框架,整合了评分量表验证(rubric validation)、实时部署反馈(live deployment feedback)、技术性能监控(technical performance monitoring)和成本追踪,并通过受控实验机制在系统变更前进行验证,从而保障AI模型在真实医疗环境中的可靠演进。
链接: https://arxiv.org/abs/2604.27309
作者: Aaryan Shah,Andrew Hines,Alexia Downs,Denis Bajet,Paulius Mui,Fabiano Araujo,Laura Offutt,Aida Rutledge,Elizabeth Jimenez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 6 tables, submitted to npj Digital Medicine
Abstract:Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.
[AI-83] BoostLoRA: Growing Effective Rank by Boosting Adapters
【速读】:该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)方法中适配器规模与表达能力之间的权衡问题:超低参数适配器受限于固定低秩子空间,即使延长训练也无法突破性能上限。其解决方案的关键在于提出BoostLoRA框架,通过梯度提升策略迭代训练并合并针对当前模型预测错误样本的最小适配器,同时采用ROTATE SVD基策略确保每轮适配器位于正交子空间,使累积有效秩随训练轮数线性增长,而单个适配器仍保持超低秩特性;训练完成后适配器被丢弃,不引入推理开销,从而实现参数成本与表示容量的解耦。
链接: https://arxiv.org/abs/2604.27308
作者: Raviteja Anantha,Nick Levato,Layne C. Price
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Parameter-efficient fine-tuning (PEFT) methods face a tradeoff between adapter size and expressivity: ultra-low-parameter adapters are confined to fixed low-rank subspaces, capping performance even with extended training. We propose BoostLoRA, a gradient-boosting framework that overcomes this limit by iteratively training and merging minimal adapters on the examples the current model gets wrong. A ROTATE SVD basis strategy assigns each round to an orthogonal subspace, so cumulative effective rank grows linearly with the number of rounds while each adapter remains ultra-low-rank. After merging, adapters are discarded, leaving zero inference overhead. On Qwen2.5-3B, BoostLoRA reaches 89.1% on GSM8K and 68.8% on MATH-500, surpassing both the best single-shot ultra-low parameter adapter (TinyLoRA) and full fine-tuning; on code generation it reaches 57.2% on MBPP and 80.4% on HumanEval while full fine-tuning drops below the zero-shot baseline. We also demonstrate cross-architecture transfer on protein binding classification with ESM2-650M and cross-entropy training. BoostLoRA is, to our knowledge, the first PEFT method whose effective rank grows with training, separating per-round parameter cost from total representational capacity.
[AI-84] METASYMBO: Multi-Agent Language-Guided Metamaterial Discovery via Symbolic Latent Evolution
【速读】:该论文旨在解决 metamaterial(超材料)设计中早期探索阶段缺乏有效工具的问题,即当研究人员仅能提供不完整约束和自然语言描述的定性意图时,现有基于数值目标的逆向设计方法难以适用,而大型语言模型(LLM)虽能理解语义意图却缺乏几何感知与物理属性有效性。解决方案的关键在于提出 MetaSymbO——一个由三个智能体组成的多智能体框架:Designer 负责解析自由形式的设计意图并检索语义一致的结构模板;Generator 在解耦的潜在空间中合成候选微结构;Supervisor 提供快速属性反馈以实现迭代优化。此外,论文创新性地引入符号驱动的潜在演化(symbolic-driven latent evolution),通过可编程算子对潜在因子进行组合、修改和精炼,从而在推理阶段生成具有高结构有效性和语义对齐性的新超材料构型,显著提升设计多样性与物理合理性。
链接: https://arxiv.org/abs/2604.27300
作者: Jianpeng Chen,Wangzhi Zhan,Dongqi Fu,Junkai Zhang,Zian Jia,Ling Li,Wei Wang,Dawei Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Metamaterial discovery seeks microstructured materials whose geometry induces targeted mechanical behavior. Existing inverse-design methods can efficiently generate candidates, but they typically require explicit numerical property targets and are less suitable for early-stage exploration, where researchers often begin with incomplete constraints and qualitative intents expressed in natural language. Large language models can interpret such intents, but they lack geometric awareness and physical property validity. To address this gap, we propose MetaSymbO, a multi-agent framework for language-guided Metamaterial discovery via Symbolic-driven latent evOlution. Specifically, MetaSymbO contains three agents: a Designer that interprets free-form design intents and retrieves a semantically consistent scaffold, a Generator that synthesizes candidate microstructures in a disentangled latent space, and a Supervisor that provides fast property-aware feedback for iterative refinement. To move beyond the limitations of reproducing known samples from literature and training data, we further introduce symbolic-driven latent evolution, which applies programmable operators over disentangled latent factors to compose, modify, and refine structures at inference time. Extensive experiments demonstrate that (i) MetaSymbO improves structural validity by up to 34% in symmetry and nearly 98% in periodicity compared to state-of-the-art baselines; (ii) MetaSymbO achieves about 6-7% higher language-guidance scores while maintaining superior structure novelty compared to advanced reasoning LLMs; (iii) qualitative analyses confirm the effectiveness of symbolic logic operators in enabling programmable semantic alignment; and (iv) realworld case studies on auxetic, high-stiffness metamaterial design further validate its practical capability.
[AI-85] Machine Collective Intelligence for Explainable Scientific Discovery
【速读】:该论文旨在解决从经验观测中自动推导可解释且具备外推能力的物理 governing equations(控制方程)这一长期科学挑战,这是当前生成式 AI (Generative AI) 在科学发现领域面临的核心瓶颈。其解决方案的关键在于提出“机器集体智能”(machine collective intelligence)这一统一范式,该范式融合符号主义(symbolism)与元启发式算法(metaheuristics)两大计算智能传统,通过多个推理代理(reasoning agents)协同进行符号假设的生成、评估、批判与整合,从而实现方程的自主演化与进化式发现。该方法无需人工设计领域知识,在确定性、随机性或未知动力学系统中均能自动恢复底层控制方程,并显著提升外推性能(误差降低达6个数量级),同时将模型参数量从数十万压缩至5–40个可解释参数。
链接: https://arxiv.org/abs/2604.27297
作者: Gyoung S. Na,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Deriving governing equations from empirical observations is a longstanding challenge in science. Although artificial intelligence (AI) has demonstrated substantial capabilities in function approximation, the discovery of explainable and extrapolatable equations remains a fundamental limitation of modern AI, posing a central bottleneck for AI-driven scientific discovery. Here, we present machine collective intelligence, a unified paradigm that integrates two fundamental yet distinct traditions in computational intelligence–symbolism and metaheuristics–to enable autonomous and evolutionary discovery of governing equations. It orchestrates multiple reasoning agents to evolve their symbolic hypotheses through coordinated generation, evaluation, critique, and consolidation, enabling scientific discovery beyond single-agent inference. Across scientific systems governed by deterministic, stochastic, or previously uncharacterized dynamics, machine collective intelligence autonomously recovered the underlying governing equations without relying on hand-crafted domain knowledge. Furthermore, the resulting equations reduced extrapolation error by up to six orders of magnitude relative to deep neural networks, while condensing 0.5-1 million model parameters into just 5-40 interpretable parameters. This study marks an important shift in AI toward the autonomous discovery of principled scientific equations.
[AI-86] Learning Rate Engineering: From Coarse Single Parameter to Layered Evolution
【速读】:该论文旨在解决深度神经网络训练中学习率调度策略的普适性与适应性问题,特别是如何在不同训练场景(如从头训练和微调)下实现更高效的参数更新。其核心挑战在于“迁移学习的不可能三角”:低层参数需小幅度更新以保留通用特征,而高层参数则需大幅调整以适配新任务。解决方案的关键是提出 Discriminative Adaptive Layer Scaling (DALS),一个统一框架,整合了相位自适应余弦调度、基于深度感知的Grokfast梯度过滤机制以及LARS风格的信任比策略,从而在层级和时间维度上动态优化学习率,避免传统方法在特定场景下的性能崩溃,同时兼顾合成数据上的高精度与下游任务中的鲁棒微调表现。
链接: https://arxiv.org/abs/2604.27295
作者: Ming-Hong Yao,Di Wang,Jian Cui,Jin-Yan Chen,Zi-Hao Cui,Fa Wang,Chen Wei,Qiu-Ye Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 3 tables
Abstract:Learning rate scheduling has evolved from the single global fixed rate of early SGD to sophisticated layer-wise adaptive strategies. We systematize this evolution into five generations: (Gen1) global fixed learning rates, (Gen2) global scheduling, (Gen3) parameter-level adaptation, (Gen4) layer-level differentiation, and (Gen5) joint layer-time scheduling. We trace the fundamental motivation behind each transition, showing how the shift from one-size-fits-all to tailoring by layer and time addresses the impossible trinity of transfer learning: lower layers require small updates to preserve general knowledge while higher layers need large updates to adapt to new tasks. Building on this taxonomy, we propose Discriminative Adaptive Layer Scaling (DALS), a unified framework that integrates phase-adaptive cosine scheduling, depth-aware Grokfast gradient filtering, and LARS-style trust ratios into a single coherent optimizer. We benchmark 18 strategies including three DALS variants across all five generations on five datasets: synthetic, CIFAR-10 (from scratch), RTE, TREC-6, and IMDb (fine-tuning). On synthetic, DALS achieves the best accuracy at 98.0%, while DALS-Fast reaches 90% in just 3 epochs. The cross-dataset analysis reveals striking regime-dependent patterns – no single strategy wins across all regimes. Critically, STLR+Discriminative, the ULMFiT champion, catastrophically fails on from-scratch tasks (43.6% on TREC-6 from scratch vs. 96.8% with RAdam), confirming that directional decay biases are harmful without pretrained features. DALS avoids either extreme, achieving the best synthetic result while maintaining competitive fine-tuning performance.
[AI-87] he Two Boundaries: Why Behavioral AI Governance Fails Structurally
【速读】:该论文旨在解决当前AI系统中治理边界(governance)与表达能力边界(expressiveness)分离所导致的结构性风险问题,即在实际部署的AI系统中,治理通常独立于功能实现,形成三个区域:受治理的能力(仅有的有效区域)、不受治理的能力(风险区)以及针对不存在能力的治理策略(剧场效应)。这种分离导致了不可控的风险和无效治理。解决方案的关键在于提出“共域治理”(coterminous governance)——即治理边界与表达能力边界完全重合的系统属性,这要求通过架构设计将计算逻辑与效果执行分离,使治理机制内嵌于执行流程而非作为附加层存在,从而从根本上消除风险和剧场效应的结构性必然性。
链接: https://arxiv.org/abs/2604.27292
作者: Alan L. McCann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures. Companion proofs: this https URL
Abstract:Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly all deployed AI systems, these boundaries are defined independently, creating three regions: governed capabilities (the only useful region), ungoverned capabilities (risk), and governance policies that address non-existent capabilities (theater). Two of the three regions are failure modes. We focus on the governance of effects: actions that AI systems perform in the world (API calls, database writes, tool invocations). This is distinct from the governance of model outputs (content quality, bias, fairness), which operates at a different level and requires different mechanisms. We present a formal framework for analyzing this structural gap. Rice’s theorem (1953) proves the gap is undecidable in the general case for any Turing-complete architecture that attempts to govern effects behaviorally: no algorithm can decide non-trivial semantic properties of arbitrary programs, including the property “this program’s effects comply with the governance policy.” We define coterminous governance: a system property where the expressivenessboundary equals the governance boundary. We show that coterminous governance requires an architectural decision (separatingcomputation from effect) rather than a governance layer added after the fact. We show that structural governance under this separation subsumes separate governance infrastructure: governance checks become part of the execution pipeline rather than a second system running alongside it. We propose coterminous governance as the testable criterion for any AI governance system: either the two boundaries are provably identical, or risk and theater are structurally inevitable. Proofs are mechanized in Coq (454 theorems, 36 modules, 0 admitted).
[AI-88] Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence
【速读】:该论文旨在解决认知工作流系统(cognitive workflow systems)中治理结构的形式化建模与验证问题,核心目标是确保复杂智能系统在无限行为下的安全性和可预测性。其解决方案的关键在于构建一套基于共归纳(coinductive)逻辑的理论框架:通过定义共归纳安全谓词(gov_safe),形式化捕捉治理安全性,并证明其在不同层级间的一致性(Governance Invariance Theorem);同时利用充分性定理(Sufficiency Theorem)证明四种原子原语(code, reason, memory, call)足以表达任意离散智能系统的组合闭包,从而实现对系统行为的完备描述;此外,通过必要性定理(Necessity Theorem)揭示“reason”原语在语义判断任务中的数学必要性,进一步支撑架构透明性设计。整个体系通过Coq机械验证(12,000行代码,454个定理)与实际运行时验证(70,000次随机指令序列测试无差异)相结合,实现了从抽象模型到部署运行环境的端到端可信保障。
链接: https://arxiv.org/abs/2604.27289
作者: Alan L. McCann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 4 figures, 1 table. Code and proofs: this https URL
Abstract:We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the Interaction Trees library with parameterized coinduction; two are proved on paper with explicit reductions. The Coinductive Safety Predicate (gov_safe) is a coinductive property that captures governance safety for infinite program behaviors, indexed by a boolean permission flag that is provably false for ungoverned I/O and true for governed interpretations (mechanized). The Governance Invariance Theorem establishes that governance is uniform across the meta-recursive tower: governance at level n+1 reduces to governance at level n by definitional equality of the type (mechanized). The Sufficiency Theorem proves that four atomic primitives (code, reason, memory, call) are expressively complete for any discrete intelligent system, formalized as compositional closure of a Kleisli category (mechanized). The Alternating Normal Form provides a canonical decomposition of any machine into alternating code and effect layers, with a confluent rewriting system (paper proof). The Necessity Theorem proves via explicit reduction to Rice’s theorem that an architecturally opaque component (the reason primitive) is mathematically necessary for problems requiring semantic judgment (paper proof). A sixth contribution connects the abstract model to the deployed runtime: the Verified Interpreter Specification formalizes the BEAM runtime’s trust, capability, and hash chain logic in Coq, then tests the running system against this specification using property-based testing with over 70,000 randomly generated directive sequences and zero disagreements. The mechanization comprises approximately 12,000 lines across 36 modules with 454 theorems and zero admitted lemmas.
[AI-89] he Inverse-Wisdom Law: Architectural Tribalism and the Consensus Paradox in Agent ic Swarms
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在复杂任务中因“群体智慧”假设而引发的逻辑偏差问题,即传统研究认为智能体协作会自然趋向于真理,但作者通过实证发现,当智能体间存在结构同质性时,系统反而可能因内部共识优先于外部逻辑真实性而加剧错误轨迹的稳定性。解决方案的关键在于揭示了“共识悖论”(Consensus Paradox)和“逆智慧定律”(Inverse-Wisdom Law),并提出以“异质性强制要求”(Heterogeneity Mandate)作为核心机制设计原则——强调终端群组完整性由合成器的接收逻辑决定,而非单纯依赖代理质量;同时定义了“部落主义系数”(Tribalism Coefficient)与“谄媚权重”(Sycophantic Weight)作为量化评估群组失效的核心参数,从而为构建具备鲁棒性的生成式 AI 群体架构提供理论依据与实践路径。
链接: https://arxiv.org/abs/2604.27274
作者: Dahlia Shehata,Ming Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI transitions toward multi-agent systems (MAS) to solve complex workflows, research paradigms operate on the axiomatic assumption that agent collaboration mirrors the “Wisdom of the Crowd”. We challenge this assumption by formalizing the Consensus Paradox: a phenomenon where agentic swarms prioritize internal architectural agreement over external logical truth. Through a 36 experiments encompassing 12,804 trajectories across three state-of-the-art (SOTA) benchmarks (GAIA, Multi-Challenge, and SWE-bench), we prove the Inverse-Wisdom Law: in kinship-dominant swarms, adding logical agents increases the stability of erroneous trajectories rather than the probability of truth. The introduction of additional logical audits converges the system toward a Logic Saturation where internal entropy hits zero while factual error hits unity. By evaluating the interaction between the 3 preeminent SOTA models (Gemini 3.1 Pro, Claude Sonnet 4.6, and GPT-5.4), we establish the Architectural Tribalism Asymmetry as a mechanistic law of transformer weights. We demonstrate that terminal swarm integrity is strictly gated by the synthesizer’s receptive logic, rather than aggregate agent quality. We define the Tribalism Coefficient and the Sycophantic Weight as the primary mechanistic determinants of swarm failure. Finally, we establish the Heterogeneity Mandate as a foundational safety requirement for resilient agentic architectures.
[AI-90] OptimusKG: Unifying biomedical knowledge in a modern multimodal graph
【速读】:该论文旨在解决当前生物医学知识图谱(Biomedical Knowledge Graphs, BKGs)在构建过程中存在的两大核心问题:一是多数知识图谱来源于非结构化文档,缺乏模式级约束(schema-level constraints);二是来自结构化资源的知识难以统一整合为一致的表示形式。其解决方案的关键在于提出OptimusKG,一个基于结构化与半结构化资源构建的多模态生物医学标注属性图(Labeled Property Graph, LPG),通过强制执行节点与边的顶层模式(top-level schema),同时保留分子、解剖、临床和环境等不同领域中的细粒度类型特异性属性、交叉引用及溯源信息,从而实现跨域知识的一致性表达与高保真存储。该方法显著提升了知识图谱的可验证性和实用性,且经PaperQA3多模态代理评估,70.0%的采样边获得文献证据支持,验证了其内容可靠性。
链接: https://arxiv.org/abs/2604.27269
作者: Lucas Vittor,Ayush Noori,Iñaki Arango,Joaquín Polonuer,Sam Rodriques,Andrew White,David A. Clifton,Marinka Zitnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Biomedical knowledge graphs (KGs) are widely used in the life sciences, yet many are derived from unstructured documents and therefore lack schema-level constrains, whereas graphs assembled from structured resources are difficult to harmonize into a unified representation. We present OptimusKG, a multimodal biomedical labeled property graph (LPG) built from structured and semi-structured resources to preserve factual, type-specific metadata across molecular, anatomical, clinical, and environmental domains. OptimusKG contains 190,531 nodes across 10 entity types, 21,813,816 edges across 26 relation types, and 67,249,863 property instances encoding 110,276,843 values across 150 distinct property keys, derived from 18 ontologies and controlled vocabularies. The graph enforces a top-level schema for nodes and edges and retains granular, type-specific properties, cross-references, and provenance across molecular, anatomical, clinical, and environmental domains. We assessed the validity of OptimusKG by evaluating whether graph relationships are supported by evidence from the scientific literature using a multimodal agent, PaperQA3. PaperQA3 identified supporting evidence for 70.0% of sampled edges, whereas 83.4% of sampled false edges received no supporting evidence. Edges without literature support were concentrated in associations derived from experimental and functional genomics resources, suggesting that OptimusKG captures biomedical knowledge that may precede synthesis in the scientific literature. OptimusKG is distributed as Apache Parquet files, providing a standardized resource for graph-based machine learning, knowledge-grounded retrieval with large language models, and biomedical discovery use cases such as hypothesis generation.
[AI-91] From Prompt to Physical Actuation: Holistic Threat Modeling of LLM -Enabled Robotic Systems
【速读】:该论文旨在解决当前LLM(Large Language Model)驱动的自主机器人系统中,传统网络安全威胁、对抗性感知攻击与对话安全威胁三类风险在感知-规划-执行(perception-planning-actuation)全链路中如何跨信任边界传播并协同作用的问题。现有研究分别关注其中某一类威胁,缺乏统一架构下的系统性分析。其解决方案的关键在于构建一个基于边缘-云架构的分层数据流图(Data Flow Diagram, DFD),并在六个跨越信任边界的交互点上应用STRIDE-per-interaction威胁分析方法,结合“传统网络威胁、对抗性威胁、对话威胁”三类威胁分类体系,首次实现了对LLM赋能机器人系统全链路威胁传导路径的建模与追踪。分析揭示了三类威胁在相同边界交叉点汇聚,并识别出三条从外部输入到物理执行的跨边界攻击链,分别暴露了语义验证缺失、跨模态翻译漏洞及无中介工具调用等关键架构缺陷。
链接: https://arxiv.org/abs/2604.27267
作者: Neha Nagaraja,Hayretdin Bahsi,Carlo R. da Cunha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Submitted to 23rd Annual International Conference on Privacy, Security, and Trust (PST2026)
Abstract:As large language models are integrated into autonomous robotic systems for task planning and control, compromised inputs or unsafe model outputs can propagate through the planning pipeline to physical-world consequences. Although prior work has studied robotic cybersecurity, adversarial perception attacks, and LLM safety independently, no existing study traces how these threat categories interact and propagate across trust boundaries in a unified architectural model. We address this gap by modeling an LLM-enabled autonomous robot in an edge-cloud architecture as a hierarchical Data Flow Diagram and applying STRIDE-per-interaction analysis across six boundary-crossing interaction points using a three-category taxonomy of Conventional Cyber Threats, Adversarial Threats, and Conversational Threats. The analysis reveals that these categories converge at the same boundary crossings, and we trace three cross-boundary attack chains from external entry points to unsafe physical actuation, each exposing a distinct architectural property: the absence of independent semantic validation between user input and actuator dispatch, cross-modal translation from visual perception to language-model instruction, and unmediated boundary crossing through provider-side tool use. To our knowledge, this is the first DFD-based threat analysis integrating all three threat categories across the full perception-planning-actuation pipeline of an LLM-enabled robotic system.
[AI-92] Self-Evolving Software Agents
【速读】:该论文旨在解决传统自主软件代理(autonomous agents)在设计时即固化目标、需求与能力,从而无法实现真正软件演化的局限性。解决方案的关键在于提出一种融合BDI(Belief-Decision-Intention)推理与大语言模型(Large Language Models, LLMs)的自演化软件代理架构(BDI-LLM),其中嵌入一个自动化演化模块,在代理的推理循环中持续从经验中提取新需求,并自动合成对应的设计与代码更新,从而实现目标、推理逻辑和可执行代码的自主进化。
链接: https://arxiv.org/abs/2604.27264
作者: Marco Robol,Paolo Giorgini
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents can adapt their behaviour to changing environments, but remain bound to requirements, goals, and capabilities fixed at design time, preventing genuine software evolution. This paper introduces self-evolving software agents, combining BDI reasoning with LLMs to enable autonomous evolution of goals, reasoning, and executable code. We propose a BDI-LLM architecture in which an automated evolution module operates alongside the agent’s reasoning loop, eliciting new requirements from experience and synthesizing corresponding design and code updates. A prototype evaluated in a dynamic multi-agent environment shows that agents can autonomously discover new goals and generate executable behaviours from minimal prior knowledge. The results indicate both the feasibility and current limits of LLM-driven evolution, particularly in terms of behavioural inheritance and stability.
[AI-93] AutoSurfer – Teaching Web Agents through Comprehensive Surfing Learning and Modeling
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在网页自动化任务中因高质量网页轨迹训练数据稀缺而导致的准确性不足问题。现有自动轨迹生成方法受限于仅基于主页的任务提案或随机探索策略,导致网站覆盖不全、任务合成存在幻觉或歧义,进而引发轨迹生成不完整且不可靠。其解决方案的关键在于提出AutoSurfer,通过三项核心创新实现突破:一是采用系统性的广度优先探索策略,维护已发现页面与操作轨迹队列,跨页面传播知识以避免冗余探索,并递归扩展多层级图形用户界面元素;二是利用探索轨迹引导任务合成,将复杂任务锚定于实际导航路径而非孤立动作或页面内容,从而减少幻觉;三是复用同一探索轨迹作为提示(hint),指导网页代理进行更精准可靠的轨迹精炼。这些机制共同实现了对网站动作空间的全面覆盖,并生成适用于特定网站大语言模型训练的数据。
链接: https://arxiv.org/abs/2604.27253
作者: Fazle Elahi Faisal,Qianhui Wu,Baolin Peng,Jianfeng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures
Abstract:Recent advances in multimodal large language models (LLMs) have revolutionized web agents that can automate complex tasks on websites. However, their accuracy remains limited by the scarcity of high-quality web trajectory training data. Existing automatic trajectory generation methods suffer from incomplete website coverage due to homepage-based task proposals or random-walk exploration. Such methods often result in hallucinated or ambiguous task synthesis that lead to incomplete and unreliable trajectory generation. Here, we present AutoSurfer, a comprehensive web trajectory generator that addresses these limitations through three key innovations. First, AutoSurfer employs a systematic breadth-first exploration strategy that maintains a queue of discovered pages and action traces, propagates knowledge across pages to avoid redundant exploration, and recursively expands multi-level graphical user interface elements - closely resembling how a human would learn a new website. Second, AutoSurfer leverages the exploration trajectory to guide task synthesis, reducing hallucinations by grounding complex tasks in actual navigation paths rather than isolated actions or page content alone. Third, AutoSurfer uses the same exploration trajectory as hints to steer a web agent toward more accurate and reliable trajectory refinement. Together, these innovations enable AutoSurfer to comprehensively cover a website’s action space and generate data suitable for training website-specific LLMs. We evaluate AutoSurfer on the WebArena benchmark by fine-tuning Qwen2.5-VL-7B-Instruct and demonstrate that it outperforms state-of-the-art methods - Explorer, OS-Genesis, and SynthAgent - achieving up to 24.23% overall task completion accuracy compared to 19.59% for the best prior method. Further, task diversity analysis demonstrates that AutoSurfer yields a more diverse distribution of synthesized tasks.
[AI-94] Addressing the Reality Gap: A Three-Tension Framework for Agent ic AI Adoption
【速读】:该论文旨在解决生成式 AI(Generative AI)和更自主的代理型 AI(Agentic AI)系统在教育领域快速渗透与教育机构响应能力滞后之间的矛盾,尤其关注如何在实际教学场景中实现可持续、负责任的AI整合。其解决方案的关键在于提出一个包含三个核心张力的框架:实施可行性(Implementation Feasibility)、适应速度(Adaptation Speed)与使命一致性(Mission Alignment),用以指导决策者评估和设计覆盖K-12及高等教育的AI部署策略,确保技术应用既提升个性化学习效果,又不违背教育公平、隐私保护和教学完整性等基本价值。
链接: https://arxiv.org/abs/2604.27245
作者: Jason Fournier(Imagine Learning),Kacper Łodzikowski(Adam Mickiewicz University, Poznań, Poland)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This is a preprint version of an edited book chapter to appear in Mayrath, M., J. Behrens, D. Robinson, (eds) (2026). Handbook of Generative AI in Education: Integrating Research into Practice, Springer
Abstract:Generative AI has rapidly entered education through free consumer tools, outpacing the ability of schools and universities to respond. Now a new wave of more autonomous agentic AI systems–with the capacity to plan and act towards goals–promises both greater educational personalization and greater disruption. This chapter argues that successfully navigating these innovations requires balancing three core tensions: (1) Implementation Feasibility, or the practical capacity to integrate AI sustainably into real classrooms; (2) Adaptation Speed, or the mismatch between fast-evolving AI capabilities and the slower pace of educational change; and (3) Mission Alignment, or the need to ensure AI applications uphold educational values such as equity, privacy, and pedagogical integrity. First, we review early evidence of generative and agentic AI in various sectors and in frontline education to illustrate these tensions in context. Then, we present a three-tension framework to guide decision-makers in evaluating and designing AI initiatives across K-12 and higher education. We provide examples of how the framework can be applied to plan responsible AI deployments, and we identify emerging trends–such as curriculum-linked AI agents and educator-informed AI design–along with open research directions. We conclude the chapter with recommendations for educational leaders to proactively engage with the opportunities and challenges of AI, so that this technology can be harnessed to enhance teaching and learning in the decade ahead.
[AI-95] Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在网络搜索任务中面临的双重挑战:一是对单一目标进行深度推理的能力不足,二是跨多个实体和异构来源的结构化聚合能力薄弱。现有系统在广度导向任务(如覆盖广泛且保持跨实体一致性)与深度导向任务(如处理长而分支的搜索轨迹)上均表现不佳。解决方案的关键在于提出 Web2BigTable,一个支持双模式(breadth- and depth-oriented)的多智能体框架,其核心创新为双层架构——上层调度器分解任务、下层工作者并行执行,并通过闭环“运行-验证-反思”机制结合可持久化的人类可读外部记忆,实现任务分解与执行的协同优化;同时,共享工作空间使部分发现可见,促进去重、冲突消解与覆盖率自适应调整,从而显著提升搜索准确率与结构化输出质量。
链接: https://arxiv.org/abs/2604.27221
作者: Yuxuan Huang,Yihang Chen,Zhiyuan He,Yuxiang Chen,Ka Yiu Lee,Huichi Zhou,Weilin Luo,Meng Fang,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce \textbfWeb2BigTable, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run–verify–reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of \textbf38.50 ( 7.5\times the second best at 5.10), Row F1 of \textbf63.53 (+25.03 over the second best), and Item F1 of \textbf80.12 (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at this https URL.
[AI-96] oward Personalized Digital Twins for Cognitive Decline Assessment: A Multimodal Uncertainty-Aware Framework
【速读】:该论文旨在解决神经退行性疾病中认知衰退个体异质性高导致的预后评估、临床试验设计和治疗规划困难的问题。其核心解决方案是提出一种个性化认知衰退评估数字孪生(Personalized Cognitive Decline Assessment Digital Twin, PCD-DT)框架,关键在于融合三个方法学组件:(1) 基于潜在状态空间模型的个体化时间动态建模,以捕捉稀疏、噪声大且不规则纵向数据中的个体轨迹;(2) 多模态融合机制,整合临床、生物标志物与影像特征;(3) 不确定性感知验证与自适应更新机制,确保数字孪生系统的鲁棒运行。此外,文中还引入条件生成模型用于增强数据稀缺的进展模式并支持压力测试,为个性化计算建模提供了可扩展且具备不确定性量化能力的架构基础。
链接: https://arxiv.org/abs/2604.27217
作者: Bulent Soykan,Gulsah Hancerliogullari Koksalmis,Hsin-Hsiung Huang,Laura J. Brattain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures
Abstract:Cognitive decline is highly heterogeneous across individuals, which complicates prognosis, trial design, and treatment planning. We present the Personalized Cognitive Decline Assessment Digital Twin (PCD-DT), a multimodal and uncertainty-aware framework for modeling patient-specific disease trajectories from sparse, noisy, and irregular longitudinal data. The framework combines three methodological components: (1) latent state-space models for individualized temporal dynamics, (2) multimodal fusion for clinical, biomarker, and imaging features, and (3) uncertainty-aware validation and adaptive updating for robust digital twin operation. We also outline how conditional generative models can support data augmentation and stress testing for underrepresented progression patterns. As a preliminary feasibility study, we analyze longitudinal TADPOLE trajectories and show clear separation between cognitively normal and Alzheimer’s disease cohorts in ADAS13, ventricle volume, and hippocampal volume over five years. We further conduct a multimodal next-visit prediction ablation using an LSTM sequence model on 3,003 visit-pair sequences derived from TADPOLE, where the combined cognitive plus MRI configuration achieves the lowest standardized RMSE for both ADAS13 (0.4419) and ventricle volume (0.5842), outperforming a Last Observation Carried Forward baseline. A Bayesian tensor modeling component for high-dimensional imaging fusion is also discussed. These results support the feasibility of the proposed architecture while also highlighting the need for stronger uncertainty calibration and longer-horizon predictive evaluation. The PCD-DT framework provides a principled starting point for personalized in silico modeling in neurodegenerative disease. This work positions PCD-DT as a foundational step toward clinically deployable, uncertainty-aware digital twin systems.
[AI-97] heory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在科研软件项目中因生成内容与实际代码、理论或实验结果脱节而导致的“幻觉累积”(hallucination accumulation)和“不同步”(desynchronization)问题,即模型生成的主张超出代码或理论支持范围,且各组件(如数学论证、可执行系统、基准测试和公开声明)常出现状态漂移。解决方案的关键在于提出 Comet-H——一个迭代式提示自动化机制(iterative prompt automaton),其通过将创意生成、实现、评估、验证和论文撰写统一为单一工作空间状态的耦合坐标,利用一个基于工作空间缺失信息的上下文感知小规模多臂赌博机(contextual bandit)策略进行提示选择,结合半衰期机制保留未完成任务,并在文档更新时自动校验代码与基准测试的一致性,从而实现透明、可追溯且无学习策略依赖的协同演化过程。
链接: https://arxiv.org/abs/2604.27209
作者: Halley Young,Nikolaj Björner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model’s own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long-horizon follow-ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research-software repositories across two dozen domains. We study A3 in depth, a Python static-analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90-case benchmark, compared with a next-best baseline of 0.364. Across approximately 400 commits, we find that audit-and-contraction passes dominate the later phases of every successful trajectory. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27209 [cs.SE] (or arXiv:2604.27209v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.27209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-98] Evaluating TabPFN for Mild Cognitive Impairment to Alzheimers Disease Conversion in Data Limited Settings
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期预测中因纵向数据有限而导致的模型可靠性不足问题,特别是针对轻度认知障碍(Mild Cognitive Impairment, MCI)向AD转化的预测任务。其解决方案的关键在于引入TabPFN(Tabular Pre-Trained Foundation Network),一种基于预训练的表格式基础模型,在小样本场景下表现出优越性能;实验表明,TabPFN在训练样本量仅为50时仍能保持高AUC(0.892),显著优于XGBoost、随机森林、LightGBM等传统机器学习方法,证明了基础模型在数据稀缺条件下进行疾病预测的潜力。
链接: https://arxiv.org/abs/2604.27195
作者: Brad Ye,Bulent Soykan,Gulsah Hancerliogullari Koksalmis,Hsin-Hsiung Huang,Laura J. Brattain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures
Abstract:Accurate prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimers Diseases (AD) is essential for early intervention, however, developing reliable conversion predictive models is difficult to develop due to limited longitudinal data availability We evaluate TabPFN (Tabular Pre-Trained Foundation Network) against traditional machine learning methods for predicting 3 year MCI to AD conversion using the TADPOLE dataset derived from ADNI. Using multimodal biomarker features extracted from demographics, APOE4, MRI volumes, CSF markers, and PET imaging, we conducted an experimental comparison across varying training set sizes (N=50 to 1000) and models including XGBoost, Random Forest, LightGBM, and Logistic Regression. TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) and demonstrating advantages in low data settings. At N=50 training samples, TabPFN maintained strong AUC while the traditional machine learning models struggles at small training samples. These findings demonstrate that foundation models are promising for disease prediction in data limited scenarios, such as Alzheimers diseases.
[AI-99] Learning to Spend: Model Predictive Control for Budgeting under Non-Stationary Returns
【速读】:该论文旨在解决有限时域预算分配问题,将其建模为闭环经济控制问题,并评估滚动时域模型预测控制(MPC)相对于反应式预算分配策略的性能。其核心挑战在于:在存在执行噪声和操作约束的前提下,如何有效应对回报效率随时间演变的非平稳性。解决方案的关键在于识别回报动态是否具有可预测结构——当回报效率表现出可建模的时序规律时,MPC通过利用跨期权衡(intertemporal trade-offs)显著优于反应式基线;而若回报仅呈现不可预测的随机漂移,则MPC并无系统性优势。
链接: https://arxiv.org/abs/2604.27186
作者: Nilavra Pathak,Smriti Shyamal,Prasant Mhasker,Christopher Swartz
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
备注: 8 pages, 0 figures
Abstract:We study finite-horizon budget allocation as a closed-loop economic control problem and evaluate receding-horizon Model Predictive Control (MPC) relative to reactive budgeting policies. Budgets are allocated periodically under execution noise and operational constraints, while return efficiency may evolve over time. Using a controlled simulation framework motivated by digital marketing, we compare reactive pacing to MPC across environments with increasing degrees of non-stationarity. Our results show that non-stationarity alone does not justify predictive control. When return dynamics are stationary or evolve through unpredictable stochastic drift, MPC offers no systematic advantage over reactive baselines. By contrast, when return efficiency exhibits predictable structure over the planning horizon, that is captured through an underlying model, MPC consistently outperforms reactive budgeting by exploiting intertemporal trade-offs.
[AI-100] Preserving Temporal Dynamics in Time Series Generation
【速读】:该论文旨在解决生成式时间序列数据增强中因忽视时间动态性而导致的分布偏移(distribution shift)和时间漂移(temporal drift)问题,尤其是在多变量时间序列生成任务中,现有基于对抗训练的方法主要关注边际分布匹配,而未能有效保留原始序列中的时序依赖结构。解决方案的关键在于提出一种模型无关的马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)框架,通过强制合成序列在相邻时间点之间的转移统计量与真实数据一致,从而纠正条件生成模型在序列生成过程中累积的偏差,实现对时间动态性的显式建模与保持。
链接: https://arxiv.org/abs/2604.27182
作者: Ci Lin,Futong Li,Tet Yeap,Iluju Kiringa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time-series data augmentation plays a crucial role in regression-oriented forecasting tasks, where limited data restricts the performance of deep learning models. While Generative Adversarial Networks (GANs) have shown promise in synthetic time-series generation, existing approaches primarily focus on matching marginal data distributions and often overlook the temporal dynamics that naturally exist in the original multivariate time series. When generating multivariate time series, this mismatch leads to distribution shift and temporal drift, thereby degrading the fidelity of the synthetic sequences. In this work, we propose a model-agnostic Markov Chain Monte Carlo (MCMC)-based framework to mitigate distribution shift and preserve temporal dynamics in synthetic time series. We provide a theoretical analysis of how conditional generative models accumulate deviations under sequential generation and demonstrate that the MCMC algorithm can correct these discrepancies by enforcing consistency with empirical transition statistics between neighboring time points. Extensive experiments on the Lorenz, Licor, ETTh, and ILI datasets using RCGAN, GCWGAN, TimeGAN, SigCWGAN, and AECGAN demonstrate that the proposed MCMC framework consistently improves autocorrelation alignment, skewness error, kurtosis error, R ^2 , discriminative score, and predictive score. These results suggest that synthetic time series consistent with the original data require explicit preservation of transition laws rather than solely relying on adversarial distribution matching, thereby offering a principled direction for improving generative modeling of time-series data. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27182 [cs.LG] (or arXiv:2604.27182v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27182 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-101] What Suppresses Nash Equilibrium Play in Large Language Models ? Mechanistic Evidence and Causal Control
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在博弈论场景中偏离纳什均衡(Nash equilibrium)的问题,尤其是探究其内在机制是否具备纳什策略计算能力以及为何会抑制该行为。解决方案的关键在于通过自对弈与跨模型对弈实验揭示行为模式,并结合对32层Llama-3-8B模型的机械解码分析发现:对手历史信息在第一层即被高保真编码(探测准确率达96%),而纳什动作编码在整个网络中始终较弱(不超过56%),且不存在专门处理纳什策略的模块;模型在前向传播过程中隐式偏好纳什动作,但在最后几层因“亲社会性覆盖”(prosocial override)将其逆转,最终在第30层实现84%的合作概率;进一步通过注入学习到的纳什方向至残差流可双向调控行为,验证了该机制的可干预性。
链接: https://arxiv.org/abs/2604.27167
作者: Paraskevas V. Lekeas,Giorgos Stamatopoulos
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, 4 tables
Abstract:LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why, or asked whether the deviation can be reversed. We do both. Working with four open-source models (Llama-3 and Qwen2.5, 8B to 72B parameters) playing four canonical two-player games, we establish the behavioral picture through self-play and cross-play experiments, then open up the 32-layer Llama-3-8B model and examine what actually happens during a strategic decision. The mechanistic findings are clear. Opponent history is encoded with near-perfect fidelity at the first layer (96% probe accuracy) and consumed progressively by later ones, while Nash action encoding is weak throughout, never exceeding 56%. There is no dedicated Nash module. Instead, the model privately favors the Nash action through most of its forward pass, but a prosocial override concentrated in the final layers reverses this, reaching 84% probability of cooperation at layer 30. When we inject a learned Nash direction into the residual stream, the behavior shifts bidirectionally, confirmed through concept clamping. The behavioral experiments surface six scale- and architecture-dependent findings, the most notable being that chain-of-thought reasoning worsens Nash play in small models but achieves near-perfect Nash play above 70B parameters. The cross-play experiments reveal three phenomena invisible in self-play: a small model can unravel any partner’s cooperation by defecting early; two large models reinforce each other’s cooperative instincts indefinitely; and who moves first in a coordination game determines which Nash equilibrium the system reaches. LLMs do not lack Nash-playing competence. They compute it, then suppress it. Comments: 11 pages, 5 figures, 4 tables Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 91A10, 68T07 ACMclasses: I.2.6; I.2.11; J.4 Cite as: arXiv:2604.27167 [cs.GT] (or arXiv:2604.27167v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2604.27167 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Paraskevas Lekeas [view email] [v1] Wed, 29 Apr 2026 20:16:54 UTC (199 KB)
[AI-102] Interval Orders Biorders and Credibility-limited Belief Revision
【速读】:该论文旨在解决传统信念修正(belief revision)框架中偏好序结构过于简化的局限性问题,即通常假设的全序(total preorder)无法充分刻画现实认知情境中的复杂信念调整过程。其解决方案的关键在于引入两种更广义的序结构:区间序(interval order)和双序(biorder),其中区间序通过为每个可能世界分配一个非负“区间”来表示可接受度,而双序进一步允许区间长度为负,从而捕捉信念间的不一致或不稳定状态(dissonance)。作者通过对这些序结构对应的信念修正算子进行公理化刻画,并提出一种基于可信度限制的修正机制(non-prioritised revision),使得修正结果满足一致性(Consistency)但不再保证成功性(Success),从而更贴近人类在面对新信息时可能先拒绝、再因额外解释而接受的认知行为模式。
链接: https://arxiv.org/abs/2604.27156
作者: Richard Booth,Ivan Varzinczak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Rational belief revision is commonly viewed as being based on a preference order between possible worlds, with the resulting new belief set being those sentences true in all the most preferred models of the incoming new information. Usually, such a preference order is taken to be a total preorder. Nevertheless, there are other, more general classes of ordering that can also be employed. In this paper, we explore two such classes that have been studied within the theory of rational choice but have seen limited or no application in belief revision. We begin with interval orders, introduced by Fishburn in the ‘80s, which associate with each possible world a nonnegative interval' of plausibility. We then move on to biorders, studied by Aleskerov, Bouyssou, and Monjardet, which generalise interval orders by allowing the intervals to have negative lengths, a feature that can be used to capture a notion of dissonance or instability. We provide axiomatic characterisations of these two resulting families of belief revision operators, as well as of two further families of interest that lie between interval orders and biorders. We show that while biorder-based revisions satisfy the Success postulate, they do not always yield consistent outputs. By modifying their definition to discard inputs that lead to inconsistency as incredible’, we derive new families of so-called non-prioritised revision that satisfy the Consistency postulate, but not the Success one. These families are linked to credibility-limited revision operators of Hansson et al., but for which the set of credible sentences does not satisfy the single-sentence closure condition. We argue that the biorder-based approach is well-suited for scenarios where an agent might initially reject new information, but may accept it when presented with additional explanation.
[AI-103] Step-level Optimization for Efficient Computer-use Agents
【速读】:该论文旨在解决当前计算机使用代理(Computer-use agents)在长时程图形用户界面(GUI)任务中计算资源分配效率低下的问题。现有系统普遍在每一步交互中都调用大型多模态模型,导致成本高、速度慢,而实际任务轨迹具有高度异质性:多数步骤为常规操作,可由轻量级策略处理;错误则集中出现在少数高风险时刻。论文提出一种事件驱动的分层式计算框架,其关键在于通过两个轻量级学习监控器实现动态算力调度:一是“卡顿监测器”(Stuck Monitor),用于检测近期推理-动作历史中的进度退化并触发恢复;二是“里程碑监测器”(Milestone Monitor),识别语义上有意义的检查点以捕捉隐性语义漂移(silent semantic drift)。该方案将始终运行前沿模型的模式转变为按需分配计算资源的自适应机制,且具备模块化与部署友好性,无需修改原有代理架构或重新训练大模型即可集成应用。
链接: https://arxiv.org/abs/2604.27151
作者: Jinbiao Wei,Kangqi Ni,Yilun Zhao,Guo Gan,Arman Cohan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make meaningful progress, and silent semantic drift, where the agent continues taking locally plausible actions after already deviating from the user’s true goal. To address this inefficiency, we propose an event-driven, step-level cascade for computer-use agents that runs a small policy by default and escalates to a stronger model only when lightweight learned monitors detect elevated risk. Our framework combines two complementary signals: a Stuck Monitor that detects degraded progress from recent reasoning-action history and triggers recovery, and a Milestone Monitor that identifies semantically meaningful checkpoints where sparse verification is most informative for catching drift. This design turns always-on frontier-model inference into adaptive, on-demand compute allocation over the course of an evolving interaction. The framework is modular and deployment-oriented: it can be layered on top of existing computer-use agents without changing the underlying agent architecture or retraining the large model.
[AI-104] Optimal Stop-Loss and Take-Profit Parameterization for Autonomous Trading Agent Swarm
【速读】:该论文旨在解决自主加密货币交易系统中退出策略(exit strategy)设计不合理的问题,即现有系统通常将入场逻辑作为优化重点,而对止损(stop-loss)和止盈(take-profit)等退出规则采用固定设定且缺乏系统性测试。其解决方案的关键在于通过大量历史交易数据(超过900笔)重新回放每笔交易,在多种替代退出策略下进行性能对比,从而识别出更优的退出参数配置。研究发现,改进后的退出策略能显著提升风险调整后收益,尤其体现在更严格的止损限制、更早的利润锁定以及更紧密的移动止损保护上,同时强调了在评估过程中需谨慎处理市场周期波动带来的偏差,例如采用随机化数据分割以减少特定异常市场阶段的影响。
链接: https://arxiv.org/abs/2604.27150
作者: Nathan Li,Aikins Laryea,Yigit Ihlamur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 3 tables
Abstract:Autonomous crypto trading systems often spend most of their design effort on finding entries, while exits are left to fixed rules that are rarely tested in a systematic way. This paper examines whether better stop-loss and take-profit settings can improve the performance of an autonomous trading agent swarm. Using more than 900 historical trades, we replay each trade under many alternative exit policies and compare results against the existing production setup. The study finds that exit design matters meaningfully: stronger configurations improve risk-adjusted performance and generally favor tighter loss limits, earlier profit capture, and closer trailing protection. The paper also discusses a key evaluation challenge: a purely chronological split was initially used, but the newest trades fell into an unusual war-driven market period that sharply distorted test results. To reduce the influence of that single episode, the main comparison was run on randomized data, with the drawbacks of doing so acknowledged explicitly. Overall, the paper presents a practical framework for tuning exit logic in a more disciplined and transparent way.
[AI-105] ConformaDecompose: Explaining Uncertainty via Calibration Localization
【速读】:该论文旨在解决传统置信预测(Conformal Prediction)方法在提供无分布假设的预测区间时,因依赖单一全局校准阈值而无法区分实例级不确定性来源的问题。具体而言,其混淆了不可约噪声(aleatoric uncertainty)与由训练数据异质性、模型局限或校准不匹配引起的可约不确定性(epistemic uncertainty),导致难以解释为何某个预测区间较宽或是否可缩小。解决方案的关键在于提出一种基于逐步局部化校准(progressive calibration localisation)的不确定性感知可解释性框架,通过诊断而非因果的方式分析校准诱导的epistemic不确定性的可约性:即随着校准支持区域逐步聚焦于测试实例,预测区间的收缩与稳定过程被量化,从而揭示不同任务下可约不确定性对整体不确定性的相对贡献,增强预测区间的可解释性而不改变原模型或覆盖保证。
链接: https://arxiv.org/abs/2604.27149
作者: Fatima Rabia Yapicioglu,Meltem Aksoy,Alberto Rigenti,Tuwe Löfström-Cavallin,Helena Löfström-Cavallin,Seyda Yoncaci,Luca Longo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is the accepted author version of a paper to appear in the proceedings of the World Explainable AI Conference (Springer). The final version will be available via Springer. This manuscript introduces ConformaDecompose, a framework for instance-wise uncertainty explainability via calibration localisation
Abstract:Conformal Prediction provides distribution-free prediction intervals with guaranteed coverage, but its reliance on a single global calibration threshold obscures the sources of uncertainty at the instance level. In particular, it conflates irreducible noise with uncertainty induced by heterogeneous training data (aleatoric), model limitations, or calibration mismatch (epistemic), offering little insight into why an interval is wide or whether it could be reduced. We introduce an uncertainty-aware explainability framework that analyses the reducibility of calibration-induced epistemic conformal uncertainty via progressive calibration localisation for regression tasks. The approach is diagnostic rather than causal: it does not estimate true aleatoric or epistemic uncertainty, but explains how conformal intervals contract and stabilise as calibration support is localised around a test instance. Across benchmarks and real-world data, absolute reducible uncertainty aligns with epistemic proxies, while its relative contribution varies by task, revealing regimes hidden by interval width. This instance-level view complements conformal uncertainty, enhancing interpretability without altering the predictor or coverage.
[AI-106] How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
【速读】:该论文旨在解决生成模型中“引导”(guidance)问题,即如何生成符合用户指定奖励信号(如美学质量或与人类偏好对齐)的样本。现有方法要么计算成本高昂(多粒子、多步方案),要么依赖难以理解的近似。论文将引导问题重新建模为确定性最优控制问题,从而构建出一个涵盖现有方法的算法层次结构;其关键创新在于发现流映射(flow map)在最优解中自然出现,并据此提出无需训练、基于单轨迹的Flow Map Reward Guidance(FMRG)框架——该框架利用流映射同时完成采样积分与引导,仅需3次数值微分方程求解(NFEs)即可在文本到图像任务中达到或超越基线性能,相较之前最先进方法提速至少一个数量级。
链接: https://arxiv.org/abs/2604.27147
作者: Jerry Y. Huang,Justin Lin,Sheel Shah,Kartik Nair,Nicholas M. Boffi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In generative modeling, we often wish to produce samples that maximize a user-specified reward such as aesthetic quality or alignment with human preferences, a problem known as guidance. Despite their widespread use, existing guidance methods either require expensive multi-particle, many-step schemes or rely on poorly understood approximations. We reformulate guidance as a deterministic optimal control problem, yielding a hierarchy of algorithms that subsumes existing approaches at the coarsest level. We show that the flow map, an object of significant recent interest for its role in fast inference, arises naturally in the optimal solution. Based on this observation, we propose Flow Map Reward Guidance (FMRG): a training-free, single-trajectory framework that uses the flow map to both integrate and guide the flow. At text-to-image scale, FMRG matches or surpasses baselines across inverse problems, style transfer, human preferences, and VLM rewards with as few as 3 NFEs, giving at least an order-of-magnitude speedup in comparison to prior state of the art.
[AI-107] Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents
【速读】:该论文旨在解决本地部署的开源权重大语言模型(Large Language Models, LLMs)在自主渗透测试中,尤其是在Linux权限提升任务上性能显著低于云端受限权重模型(如GPT-4o)的问题。其核心挑战在于如何通过系统性干预手段弥补小规模开源模型在复杂安全任务中的能力差距。解决方案的关键在于设计并验证五种针对性增强策略:链式思维提示(chain-of-thought prompting)、检索增强生成(retrieval-augmented generation)、结构化提示(structured prompts)、历史压缩(history compression)以及反思分析(reflective analysis),这些方法作为 hackingBuddyGPT 的扩展模块被集成并评估。实验表明,启用这些干预后,本地部署的 Llama3.1 70B 模型可实现对83%测试漏洞的有效利用,甚至超越云端基线;而较小模型如 Llama3.1 8B 和 Qwen2.5 7B 在引导下亦能达到67%的成功率,其中基于反思的处理方式贡献最大,但漏洞发现仍是本地模型的主要瓶颈。
链接: https://arxiv.org/abs/2604.27143
作者: Benjamin Probst,Andreas Happe,Jürgen Cito
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud-based restricted-weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open-weight alternatives. Prior work shows that small open-weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability. In this paper, we present a systematic empirical study of whether targeted system-level and prompting interventions can bridge this performance gap. We analyze failure modes of open-weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT. Our results show that open-weight models can match or outperform cloud-based baselines such as GPT-4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full-factorial ablation study over all treatment combinations reveals that reflection-based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.27143 [cs.CR] (or arXiv:2604.27143v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.27143 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-108] RUST: A Framework for Decentralized AI Service v.0.1
【速读】:该论文旨在解决高风险场景下大型推理模型(Large Reasoning Models, LRM)与多智能体系统(Multi-Agent Systems, MAS)在可靠性验证方面的问题,现有集中式验证方法存在鲁棒性差、可扩展性弱、透明度低和隐私泄露四大局限。其核心解决方案是提出TRUST框架,关键创新包括:(i) 层级有向无环图(Hierarchical Directed Acyclic Graphs, HDAGs)将思维链(Chain-of-Thought)推理分解为五个抽象层级以实现并行分布式审计;(ii) DAAN协议通过因果交互图(Causal Interaction Graphs, CIGs)对多智能体交互进行投影,实现确定性的根因归因;(iii) 多层共识机制结合计算校验器、大语言模型(LLM)评估者与人类专家的加权投票,在30%恶意参与条件下仍保证正确性。该框架通过链上记录决策与隐私保护设计,实现了可信AI部署的透明性、鲁棒性和可审计性。
链接: https://arxiv.org/abs/2604.27132
作者: Yu-Chao Huang,Zhen Tan,Mohan Zhang,Pingzhi Li,Zhuo Zhang,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) and Multi-Agent Systems (MAS) in high-stakes domains demand reliable verification, yet centralized approaches suffer four limitations: (1) Robustness, with single points of failure vulnerable to attacks and bias; (2) Scalability, as reasoning complexity creates bottlenecks; (3) Opacity, as hidden auditing erodes trust; and (4) Privacy, as exposed reasoning traces risk model theft. We introduce TRUST (Transparent, Robust, and Unified Services for Trustworthy AI), a decentralized framework with three innovations: (i) Hierarchical Directed Acyclic Graphs (HDAGs) that decompose Chain-of-Thought reasoning into five abstraction levels for parallel distributed auditing; (ii) the DAAN protocol, which projects multi-agent interactions into Causal Interaction Graphs (CIGs) for deterministic root-cause attribution; and (iii) a multi-tier consensus mechanism among computational checkers, LLM evaluators, and human experts with stake-weighted voting that guarantees correctness under 30% adversarial participation. We prove a Safety-Profitability Theorem ensuring honest auditors profit while malicious actors incur losses. All decisions are recorded on-chain, while privacy-by-design segmentation prevents reconstruction of proprietary logic. Across multiple LLMs and benchmarks, TRUST attains 72.4% accuracy (4-18% above baselines) and remains resilient against 20% corruption. DAAN reaches 70% root-cause attribution (vs. 54-63% for standard methods) with 60% token savings. Human studies validate the design (F1 = 0.89, Brier = 0.074). The framework supports (A1) decentralized auditing, (A2) tamper-proof leaderboards, (A3) trustless data annotation, and (A4) governed autonomous agents, pioneering decentralized AI auditing for safe, accountable deployment of reasoning-capable systems.
[AI-109] Unsupervised Electrofacies Classification and Porosity Characterization in the Offshore Keta Basin Using Wireline Logs CEC
【速读】:该论文旨在解决 offshore Keta Basin(加纳近海凯塔盆地)因岩心数据稀缺而难以进行电成因相(electrofacies)分析的问题。其解决方案的关键在于采用无监督机器学习方法,具体为在多变量测井数据空间中应用K-means聚类,并通过惯性(inertia)和轮廓系数(silhouette coefficient)等定量指标评估聚类结构,最终识别出四个具有地质意义的电成因相单元,其分布呈现与黏土含量、孔隙度及岩石骨架性质相关的深度连续特征,从而实现仅基于测井数据的可靠储层表征。
链接: https://arxiv.org/abs/2604.27126
作者: Hamdiya Adams,Theophilus Ansah-Narh,Daniel Kwadwo Asiedu,Bruce Kofi Banoeng-Yakubo,Marcellin Atemkeng,Thomas Armah,Richmond Opoku-Sarkodie,Rebecca Davis,Ezekiel Nii Noye Nortey
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
备注: 7 pages, 7 figures. Accepted to ICECET 2026
Abstract:This study presents an unsupervised machine learning workflow for electrofacies analysis in the offshore Keta Basin, Ghana, where core data are scarce. Six standard wireline logs from Well~C were analysed over a depth interval comprising approximately 11,195 samples. K-means clustering was applied in multivariate log space, with the clustering structure evaluated using inertia and silhouette diagnostics. Four clusters were identified, supported by an average silhouette coefficient of approximately 0.50 , indicating moderate but meaningful separation. The resulting electrofacies exhibit systematic, depth-continuous patterns associated with variations in clay content, porosity, and rock framework properties, forming a geological continuum from shale-dominated to cleaner sandstone-dominated units. The results demonstrate that log-only, unsupervised clustering supported by quantitative metrics provides a robust and reproducible framework for subsurface characterisation. The proposed workflow offers a practical tool for early-stage formation evaluation in frontier offshore basins and a foundation for future integrated studies.
[AI-110] PALCAS: A Priority-Aware Intelligent Lane Change Advisory System for Autonomous Vehicles using Federated Reinforcement Learning
【速读】:该论文旨在解决自动驾驶车辆(AV)在复杂交通环境中进行变道决策时,如何兼顾交通效率、安全性与多车协同的问题。传统方法多局限于单智能体或集中式多智能体系统,难以实现分布式协作下的优先级调度与动态环境适应。其解决方案的关键在于提出一种基于联邦强化学习的多智能体变道辅助系统 PALCAS,通过引入优先级感知的安全变道奖励函数,使车辆根据目的地紧迫性合理决策;同时采用参数化深度Q网络(PDQN)算法,在不共享原始数据的前提下实现多车间横向与纵向运动控制的有效协同,从而提升整体交通流性能。
链接: https://arxiv.org/abs/2604.27118
作者: Yassine Ibork,Nhat Ha Nguyen,Myounggyu Won,Lokesh Das
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a priority-aware intelligent lane change advisory system based on multi-agent federated reinforcement learning, namely PALCAS, for autonomous vehicles (AVs). While existing lane-change approaches typically focus on single-agent systems or centralized multi-agent systems, we introduce a federated reinforcement learning-based multi-agent lane change system prioritizing lane changing based on vehicle destination urgency. PALCAS incorporates a novel priority-aware safe lane-change reward function to enable judicious lane-change decisions in both mandatory and discretionary scenarios. PALCAS leverages the parameterized deep Q-network (PDQN) algorithm to facilitate effective cooperation among agents, enabling both lateral and longitudinal motion controls of AVs. Extensive simulations conducted using the SUMO traffic simulator and Mosaic V2X communication framework demonstrate that PALCAS significantly improves traffic efficiency, driving safety, comfort, destination arrival rates, and merging success rates compared to baseline methods.
[AI-111] Anomaly Detection in Soil Heavy Metal Contamination Using Unsupervised Learning for Environmental Risk Assessment
【速读】:该论文旨在解决加纳快速城市化地区土壤重金属污染的异常模式识别与风险评估难题,尤其针对非正规垃圾处置点存在的潜在健康威胁。其解决方案的关键在于采用无监督机器学习框架(包括孤立森林、主成分分析(PCA)重构误差和DBSCAN算法),从12个废物处置点及住宅对照区采集的78个土壤样本中自动识别出具有统计显著性的异常污染样本,并结合健康风险指数(如危害指数HI和终身致癌风险ILCR)进行验证。研究发现,基于多方法共识的异常检测能精准定位高风险区域(如S3站点),并揭示不同类型的污染模式(如Cu富集、Ni缺乏或Pb-Zn协同升高),从而实现对污染源的精细化识别与优先级排序,为环境管理提供数据驱动的决策依据。
链接: https://arxiv.org/abs/2604.27102
作者: Isaac Tettey Adjokatse,Samuel Senyo Koranteng,George Yamoah Afrifa,Theophilus Ansah-Narh,Marcellin Atemkeng,Joseph Bremang Tandoh,Kow Ahor Essel-Yorke,Richmond Opoku-Sarkodie,Rebecca Davis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Geophysics (physics.geo-ph)
备注: 7 pages, 6 figures, IEEE Conference
Abstract:Soil contamination by heavy metals poses a persistent environmental and public health concern in rapidly urbanising regions of Ghana, particularly at unregulated waste disposal sites. This study applies an unsupervised machine learning framework to detect and characterise anomalous heavy metal contamination patterns in soils from twelve waste sites and residential controls in the Central Region, of Ghana. Concentrations of eight metals (As, Cd, Cr, Cu, Hg, Ni, Pb, Zn) were analysed alongside standard health risk indices, including the Hazard Index (HI) and Incremental Lifetime Cancer Risk (ILCR). Isolation Forest and PCA reconstruction error each identified 12 anomalous samples ( 15.4% of 78 samples), while DBSCAN detected no density-isolated noise points. A consensus approach isolated six robust anomalies ( 7.7%) , all spatially concentrated at a single site (S3). Anomalies exhibited approximately 70 – 80% higher mean HI values than normal samples, with all consensus anomalies exceeding the HI =1 threshold. PCA reconstruction error showed a strong positive association with HI ( r \approx 0.8 ), indicating consistency between multivariate deviation and health risk. Three distinct anomaly types were identified: extreme Cu enrichment at S3, anomalously low Ni at S4/S5, and moderate multi-metal (Pb–Zn) co-elevation at S9–S12. The results demonstrate that unsupervised machine learning provides granular, objective insight beyond aggregate indices, enabling targeted site prioritisation and risk-informed environmental management.
[AI-112] hink it Run it: Autonomous ML pipeline generation via self-healing multi-agent AI
【速读】:该论文旨在解决机器学习(Machine Learning, ML)流水线自动化构建中的效率低、鲁棒性差和可解释性不足的问题,特别是在从数据集和自然语言(Natural Language, NL)目标出发实现端到端ML流程生成时面临的挑战。其解决方案的关键在于提出一个统一的五代理架构,集成代码增强型检索增强生成(code-grounded Retrieval-Augmented Generation, RAG)以理解微服务语义、基于多准则的可解释混合推荐机制用于任务调度、基于大语言模型(Large Language Model, LLM)的错误解析与执行历史自适应学习构成的自愈机制,从而在单一框架内实现高效、鲁棒且可解释的ML流水线自动构造。
链接: https://arxiv.org/abs/2604.27096
作者: Adela Bara,Gabriela Dobrita,Simona-Vasilica Oprea
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The purpose of our paper is to develop a unified multi-agent architecture that automates end-to-end machine learning (ML) pipeline generation from datasets and natural-language (NL) goals, improving efficiency, robustness and explainability. A five-agent system is proposed to handle profiling, intent parsing, microservice recommendation, Directed Acyclic Graph (DAG) construction and execution. It integrates code-grounded Retrieval-Augmented Generation (RAG) for microservice understanding, an explainable hybrid recommender combining multiple criteria, a self-healing mechanism using Large Language Model (LLM)-based error interpretation and adaptive learning from execution history. The approach is evaluated on 150 ML tasks across diverse scenarios. The system achieves an 84.7% end-to-end pipeline success rate, outperforming baseline methods. It demonstrates improved robustness through self-healing and reduces workflow development time compared to manual construction. The study introduces a novel integration of code-grounded RAG, explainable recommendation, self-healing execution and adaptive learning within a single architecture, showing that tightly coupled intelligent components can outperform isolated solutions.
[AI-113] End-to-end autonomous scientific discovery on a real optical platform
【速读】:该论文旨在解决如何实现生成式 AI (Generative AI) 在真实物理系统中完成从问题提出到实验验证的端到端自主科学发现的问题,尤其是针对非平凡物理机制的识别与实证验证。其解决方案的关键在于构建一个名为“求是发现引擎”(Qiushi Discovery Engine)的基于大语言模型(Large Language Model, LLM)的智能体系统,该系统通过整合非线性研究阶段、Meta-Trace记忆机制以及双层架构,实现了在长达数千次LLM推理、测量与修正动作中保持研究轨迹的自适应性和稳定性;该引擎不仅复现了已发表的传输矩阵实验并首次观测到相干阶结构,更在开放探索中自主提出并实验验证了一种新型光学双线性相互作用机制,该机制在结构上类比于Transformer注意力中的核心运算,为高速低功耗光计算硬件提供了新路径。
链接: https://arxiv.org/abs/2604.27092
作者: Shuxing Yang,Fujia Chen,Rui Zhao,Junyao Wu,Yize Wang,Haiyao Luo,Ning Han,Qiaolu Chen,Yuze Hu,Wenhao Li,Mingzhu Li,Hongsheng Chen,Yihao Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Optics (physics.optics)
备注: 25 pages, 4 figures
Abstract:Scientific research has long been human-led, driving new knowledge and transformative technologies through the continual revision of questions, methods and claims as evidence accumulates. Although large language model (LLM)-based agents are beginning to move beyond assisting predefined research workflows, none has yet demonstrated end-to-end autonomous discovery in a real physical system that produces a nontrivial result supported by experimental evidence. Here we introduce Qiushi Discovery Engine, an LLM-based agentic system for end-to-end autonomous scientific discovery on a real optical platform. Qiushi Engine combines nonlinear research phases, Meta-Trace memory and a dual-layer architecture to maintain adaptive and stable research trajectories across long-horizon investigations involving thousands of LLM-mediated reasoning, measurement and revision actions. It autonomously reproduces a published transmission-matrix experiment on a non-original platform and converts an abstract coherence-order theory into experimental observables, providing, to our knowledge, the first observation of this class of coherence-order structure. More importantly, in an open-ended study involving 145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes and 44 scripts, Qiushi Engine proposes and experimentally validates optical bilinear interaction, a physical mechanism structurally analogous to a core operation in Transformer attention. This AI-discovered mechanism suggests a route towards high-speed, energy-efficient optical hardware for pairwise computation. To our knowledge, this is the first demonstration of an AI agentic system autonomously identifying and experimentally validating a nontrivial, previously unreported physical mechanism, marking a milestone for research-level autonomous agents.
[AI-114] Efficient Training on Multiple Consumer GPUs with RoundPipe
【速读】:该论文旨在解决在消费级GPU上微调大语言模型(Large Language Models, LLMs)时,因显存受限和PCIe带宽瓶颈导致的训练效率低下问题。现有流水线并行(Pipeline Parallelism, PP)调度方法受制于“权重绑定”(weight binding)缺陷——即模型不同阶段(如语言模型头)分配到GPU后无法动态调整,导致整个流水线吞吐受限于负载最重的GPU,产生严重流水线空泡(pipeline bubbles)。其解决方案的关键在于提出RoundPipe,一种打破权重绑定约束的新颖流水线调度策略:将GPU视为无状态执行工作节点,通过轮询方式动态分发计算阶段,实现近零空泡的高效流水线;同时集成优先级感知的数据传输调度引擎、细粒度分布式事件同步协议及自动层划分算法,保障训练正确性与系统效率。实验表明,RoundPipe在8×RTX 4090服务器上相较最优基线提速1.48–2.16倍,并成功支持单机微调Qwen3-235B模型(31K序列长度)。
链接: https://arxiv.org/abs/2604.27085
作者: Yibin Luo,Shiwei Gao,Huichuan Zheng,Youyou Lu,Jiwu Shu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Github Repo: this https URL Project website: this https URL
Abstract:Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline’s throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round-robin manner, achieving a near-zero-bubble pipeline. To ensure training correctness and system efficiency, RoundPipe integrates a priority-aware transfer scheduling engine, a fine-grained distributed event-based synchronization protocol, and an automated layer partitioning algorithm. Evaluations on an 8 \times RTX 4090 server demonstrate that RoundPipe achieves 1.48–2.16 \times speedups over state-of-the-art baselines when fine-tuning 1.7B to 32B models. Remarkably, RoundPipe enables LoRA fine-tuning of the Qwen3-235B model with 31K sequence length on a single server. RoundPipe is publicly available as an open-source Python library with comprehensive documentation. Comments: Github Repo: this https URL Project website: this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.27085 [cs.DC] (or arXiv:2604.27085v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.27085 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-115] When Your LLM Reaches End-of-Life: A Framework for Confident Model Migration in Production Systems
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在生产环境中因生命周期终止或需升级而引发的迁移问题,特别是在有限人工评估数据条件下如何可靠地比较不同模型性能。其解决方案的关键在于提出一种贝叶斯统计方法,通过将自动化评估指标与人工判断进行校准,从而在少量人工标注数据下实现对模型质量的置信度评估,确保迁移决策的科学性和可重复性。
链接: https://arxiv.org/abs/2604.27082
作者: Emma Casey,David Roberts,David Sim,Ian Beaver
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 12 pages with appendix
Abstract:We present a framework for migrating production Large Language Model (LLM) based systems when the underlying model reaches end-of-life or requires replacement. The key contribution is a Bayesian statistical approach that calibrates automated evaluation metrics against human judgments, enabling confident model comparison even with limited manual evaluation data. We demonstrate this framework on a commercial question-answering system serving 5.3M monthly interactions across six global regions; evaluating correctness, refusal behavior, and stylistic adherence to successfully identify suitable replacement models. The framework is broadly applicable to any enterprise deploying LLM-based products, providing a principled, reproducible methodology for model migration that balances quality assurance with evaluation efficiency. This is a capability increasingly essential as the LLM ecosystem continues to evolve rapidly and organizations manage portfolios of AI-powered services across multiple models, regions, and use cases.
[AI-116] Learning Rate Transfer in Normalized Transformers
【速读】:该论文旨在解决nGPT(Normalized Transformer)在模型宽度(width)、深度(depth)和token horizon之间缺乏学习率迁移(learning rate transfer)的问题,尽管其超参数已显式随模型规模缩放。解决方案的关键在于结合数值实验与对齐指数(alignment exponents)的合理运用,重新审视并改进了μP(μP approach)超参数迁移方法,从而提出一种新的nGPT参数化方式——νGPT。通过大量实证验证,νGPT实现了跨模型维度的学习率迁移,显著提升了训练效率与可扩展性。
链接: https://arxiv.org/abs/2604.27077
作者: Boris Shigida,Boris Hanin,Andrey Gromov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the \mu P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call \nu GPT. Through extensive empirical validation, we find \nu GPT exhibits learning rate transfer across width, depth, and token horizon.
[AI-117] NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning
【速读】:该论文旨在解决持续学习(continual learning)中的稳定性-可塑性权衡问题,即模型在学习新任务时需保持对先前知识的稳定性,同时具备足够的可塑性以适应新任务。传统基于正则化的固定容量网络方法隐式依赖于一个预先设定的“最优架构”,而该架构对未来的任务流特性(如任务数量和特征空间重叠程度)具有未知性,导致在任务关系弱时资源枯竭、关系强时过度配置。解决方案的关键在于提出NORACL框架,通过模拟生物神经发生(neurogenesis)机制实现神经元的动态增长:模型从紧凑结构出发,仅在检测到表征饱和或可塑性饱和时扩展网络,从而按需分配资源。实验表明,NORACL在不同任务数量与几何分布下均优于或等同于预设最优参数量的静态基线,且具有可解释的生长模式——不相似任务主要扩展特征提取层,共享特征的任务则倾向于向后层增长,有效提升了持续学习的效率与灵活性。
链接: https://arxiv.org/abs/2604.27031
作者: Karthik Charan Raghunathan,Christian Metzner,Laura Kriener,Melika Payvand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 23 pages, 6 figures and 3 tables
Abstract:In a continual learning setting, we require a model to be plastic enough to learn a new task and stable enough to not disturb previously learned capabilities. We argue that this dilemma has an architectural root. A finite network has limited representational and plastic resources, yet the required capacity depends on properties of the future task stream that are unknown: how many tasks will be encountered, and how much they overlap in feature space. Regularization-based methods preserve past knowledge within fixed-capacity architectures and therefore implicitly rely on an oracle architecture sized for this unknown future. When tasks are only weakly related, fixed architectures progressively run out of plastic resources; when tasks are few or strongly overlapping, models are often over-provisioned. Inspired by neurogenesis in biology, we propose NORACL to address the stability-plasticity dilemma by tackling the oracle architecture problem through neuronal growth. Starting from a compact network, NORACL grows only when needed by monitoring two complementary signals for representational and plasticity saturation. We evaluate NORACL against oracle-sized static baselines across varying task counts and geometries. Across all settings, NORACL achieves final average accuracies that are better than or on par with oracle-provisioned static baselines while using fewer parameters. Additionally, NORACL yields architectures with interpretable growth, i.e. dissimilar tasks predominantly expand feature-extraction layers, whereas tasks which rely on common features shift growth toward later feature-combination layers. Our analysis further explains why fixed-capacity networks lose plasticity as tasks accumulate, whereas NORACL creates fresh capacity for new tasks through growth. Together, these results show that adaptive neurogenesis pushes the stability-plasticity Pareto frontier of continual learning.
[AI-118] Automatic Causal Fairness Analysis with LLM -Generated Reporting
【速读】:该论文旨在解决AutoML(自动化机器学习)框架在实际应用中忽视训练数据及预测结果潜在不公平性的问题。其核心挑战在于如何在不依赖人工干预的情况下,自动评估和报告数据集层面的公平性,尤其是在存在混杂因素(confounders)和中介变量(mediators)时。解决方案的关键在于引入基于Plečko与Bareinboim提出的标准公平性模型(standard fairness model),通过因果推理中的反事实查询(counterfactual queries)来量化受保护特征(protected variable)对目标变量的影响,从而实现基于因果效应的公平性评估。该方法进一步通过闭式计算(closed-form computation)高效完成分析,并利用大语言模型(LLM)以零样本方式生成准确的公平性报告,显著优于直接由LLM进行人工分析的结果。
链接: https://arxiv.org/abs/2604.27011
作者: Alessia Berarducci,Eric Rossetto,Alessandro Antonucci,Marco Zaffalon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures
Abstract:AutoML, intended as the process of automating the application of machine learning to real-world problems, is a key step for AI popularisation. Most AutoML frameworks are not accounting for the potential lack of fairness in the training data and in the corresponding predictions. We introduce \textscFairMind, a software prototype aiming to automatise fairness analysis at the dataset level. We achieve that by resorting to the assumptions of the \emphstandard fairness model, recently proposed by Plečko and Bareinboim. This allows for a sound fairness evaluation in terms of causal effects, based on \emphcounterfactual queries involving the target, possibly confounders and mediators, and the different values of an input feature we regard as \emphprotected. After the necessary data preprocessing, the tool implements a closed-form computation of the effects. LLMs are consequently exploited to generate accurate reports on the fairness levels detected in the training dataset. We achieve that in a zero-shot setup and show by examples the expected advantages with respect to a direct analysis performed by the LLM. To favour applications, extensions to ordinal protected variable and continuous targets and novel decomposition results are also discussed.
[AI-119] Binary Spiking Neural Networks as Causal Models
【速读】:该论文旨在解决二值脉冲神经网络(Binary Spiking Neural Networks, BSNNs)的可解释性问题,即如何从像素级特征出发,提供对网络分类决策的逻辑可解释性。其解决方案的关键在于将BSNN的脉冲活动形式化为一个二值因果模型(binary causal model),并利用基于逻辑的求解方法(如SAT和SMT求解器)计算其归因解释(abductive explanations)。这种方法能够保证所发现的解释不包含完全无关的特征,从而在理论上优于依赖边际贡献估计的现有方法(如SHAP),提升了解释结果的准确性与可靠性。
链接: https://arxiv.org/abs/2604.27007
作者: Aditya Kar(CNRS, IRIT),Emiliano Lorini(CNRS, IRIT),Timothée Masquelier(CNRS, CERCO UMR5549)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We provide a causal analysis of Binary Spiking Neural Networks (BSNNs) to explain their behavior. We formally define a BSNN and represent its spiking activity as a binary causal model. Thanks to this causal representation, we are able to explain the output of the network by leveraging logic-based methods. In particular, we show that we can successfully use a SAT as well as a SMT solver to compute abductive explanations from this binary causal model. To illustrate our approach, we trained the BSNN on the standard MNIST dataset and applied our SAT-based and SMT-based methods to finding abductive explanations of the network’s classifications based on pixel-level features. We also compared the found explanations against SHAP, a popular method used in the area of explainable AI. We show that, unlike SHAP, our approach guarantees that a found explanation does not contain completely irrelevant features.
[AI-120] Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
【速读】:该论文旨在解决系统性文献综述(Systematic Literature Review, SLR)中研究筛选阶段成本高、不一致性风险大且存在风险不对称性的问题,尤其是虚假阴性可能损害综述的有效性。其解决方案的关键在于通过统一实验协议系统评估大型语言模型(Large Language Models, LLMs)与传统分类器(如逻辑回归、支持向量机、随机森林和朴素贝叶斯)在筛选任务中的表现差异,重点考察LLMs的性能变异性、输入元数据(摘要、标题、关键词)配置的影响以及LLMs相对于经典模型的实际增益。结果表明,LLMs存在显著异质性和残余非确定性,摘要信息对性能至关重要,而添加标题或关键词并未带来稳健提升;相比传统模型,LLMs并无一致优势,其采用应基于可重复性、成本及元数据可用性的运营与治理约束,并辅以试点验证和明确报告变异性与输入配置。
链接: https://arxiv.org/abs/2604.27006
作者: Gilberto Sussumu Hida,Danilo Monteiro Ribeiro,Erika Yahata
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures. Earlier, shorter, conference-style version of a more comprehensive journal manuscript currently under review
Abstract:Context: Study screening in systematic literature reviews is costly, inconsistency-prone, and risk-asymmetric, since false negatives can compromise validity. Despite rapid uptake of Large Language Models (LLMs), there is limited evidence on how such models behave during the study screening phase, particularly regarding the choice of specific LLMs and their comparison with classical models. Objective: To assess LLM performance and variability in screening, quantify the impact of input metadata (abstract, title, keywords), and compare LLMs with classical classifiers under a shared protocol. Methods: We analyzed 12 LLMs from 4 providers (OpenAI, Google Gemini, Anthropic, Llama) and 4 classical models (Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes) on 2 real Systematic Literature Reviews (SLRs), totaling 518 papers. The experimental design investigated 3 critical dimensions: (i) LLMs performance variability, (ii) the impact of input feature composition (abstract, title, and keywords) on LLM performance, and (iii) the real gain of using LLMs instead of more traditional classification models. Results: LLMs exhibited substantial heterogeneity and residual non-determinism even at temperature zero. Abstract availability was decisive: removing it consistently degraded performance, while adding title and/or keywords to the abstract yielded no robust gains. Compared to classical models, performance differences were not consistent enough to support generalizable LLM superiority. Discussion: LLM adoption should be justified by operational and governance constraints (reproducibility, cost, metadata availability), supported by pilot validation and explicit reporting of variability and input configuration.
[AI-121] When Continual Learning Moves to Memory: A Study of Experience Reuse in LLM Agents
【速读】:该论文旨在解决生成式 AI(Generative AI)在持续学习(continual learning)场景下面临的稳定性-可塑性权衡问题(stability-plasticity dilemma)。传统方法通过更新模型参数来适应新任务,但易导致灾难性遗忘;而记忆增强的大语言模型(Memory-augmented LLM)代理则试图将经验存储于外部记忆中以规避此问题。论文指出,这一策略并未真正消除挑战,而是将瓶颈从参数更新转移到了记忆访问层面:在有限上下文窗口下,旧与新经验在检索时发生竞争,从而引发新的遗忘或负迁移问题。其解决方案的关键在于提出一个 (k,v) 框架,该框架解耦了外部记忆的两个核心设计维度——经验表示(representation)和组织方式(organization for retrieval),并通过 ALFWorld 和 BabyAI 中的序列任务实验发现:抽象的过程型记忆比详细轨迹更利于迁移,且细粒度的记忆组织并非普遍有益,反而可能在提升正向迁移的同时加剧对困难任务的遗忘。这表明,持续学习的本质问题并未消失,而是转化为如何优化记忆的表征与检索机制的设计问题。
链接: https://arxiv.org/abs/2604.27003
作者: Qisheng Hu,Quanyu Long,Wenya Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Working in progress
Abstract:Memory-augmented LLM agents offer an appealing shortcut to continual learning: rather than updating model parameters, they accumulate experience in external memory, seemingly sidestepping the stability-plasticity dilemma of parametric learning. We show that this challenge does not disappear but resurfaces at the memory level. Under a limited context window, old and new experiences compete during retrieval, relocating the continual-learning bottleneck from parameter updates to memory access. To study this phenomenon, we introduce a (k,v) framework that disentangles two fundamental design axes of external memory: how experience is represented and how it is organized for retrieval. Across sequential-task experiments in ALFWorld and BabyAI, we find that abstract procedural memories transfer more reliably than detailed trajectories, while negative transfer disproportionately harms the hard cases. Moreover, finer-grained memory organization is not universally beneficial: designs that yield strong forward transfer can simultaneously induce severe forgetting. Together, these results reveal that external memory does not resolve the continual-learning problem; it reshapes it into a problem of memory representation and retrieval design.
[AI-122] Compositional Meta-Learning for Mitigating Task Heterogeneity in Physics-Informed Neural Networks
【速读】:该论文旨在解决参数化偏微分方程(Parameterized Partial Differential Equations, PDEs)族中,传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)在多任务场景下因训练成本高、跨任务迁移敏感且易受负迁移影响而难以高效泛化的问题。解决方案的关键在于提出一种可组合的框架——学习亲和力自适应模块化物理信息神经网络(Learning-Affinity Adaptive Modular Physics-Informed Neural Network, LAM-PINN),其通过引入基于简短迁移会话的学习亲和力度量构建任务表征,并据此对任务进行聚类,即使仅使用坐标输入也能实现有效分组;模型进一步分解为簇特异性子网络与共享元网络,并学习路由权重以选择性地复用模块,从而替代单一全局初始化策略,显著提升资源受限工程场景中对未见配置的泛化性能。
链接: https://arxiv.org/abs/2604.26999
作者: Beomchul Park,Minsu Koh,Heejo Kong,Seong-Whan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Pattern Recognition
Abstract:Physics-informed neural networks (PINNs) approximate solutions of partial differential equations (PDEs) by embedding physical laws into the loss function. In parameterized PDE families, variations in coefficients or boundary/initial conditions define distinct tasks. This makes training individual PINNs for each task computationally prohibitive, while cross-task transfer can be sensitive to task heterogeneity. While meta-learning can reduce retraining cost, existing methods often rely on a single global initialization and may suffer from negative transfer, particularly under feature-scarce coordinate inputs and limited training-task availability. We propose the Learning-Affinity Adaptive Modular Physics-Informed Neural Network (LAM-PINN), a compositional framework that leverages task-specific learning dynamics. LAM-PINN combines PDE parameters with learning-affinity metrics from brief transfer sessions to construct a task representation and cluster tasks even with coordinate-only inputs. It decomposes the model into cluster-specialized subnetworks and a shared meta network, and learns routing weights to selectively reuse modules instead of relying on a single global initialization. Across three PDE benchmarks, LAM-PINN achieves an average 19.7-fold reduction in mean squared error (MSE) on unseen tasks using only 10% of the training iterations required by conventional PINNs. These results indicate its effectiveness for generalization to unseen configurations within bounded design spaces of parameterized PDE families in resource-constrained engineering settings.
[AI-123] People-Centred Medical Image Analysis
【速读】:该论文旨在解决当前数据驱动的医学人工智能(Medical AI)系统在临床实践中采纳率低的问题,其核心症结在于:一方面,现有模型缺乏跨人群的公平性优化,可能导致性能偏差引发监管障碍;另一方面,AI与临床工作流程的整合不足,易扰乱医生日常操作、削弱人机协作质量,并降低临床医生对AI工具的接受度。解决方案的关键是提出以人为中心的医学图像分析框架(People-Centred Medical Image Analysis, PecMan),通过一个动态门控机制,在考虑医生工作负荷约束的前提下,智能分配病例至AI、人类专家或二者协同处理,从而同时优化诊断准确性、公平性和工作流有效性。该框架还配套引入了公平性与以人为本的AI评估基准(Fairness and Human-Centred AI, FairHAI),用于量化评估三者之间的权衡关系,实验表明PecMan显著优于现有方法,为构建更可信且可落地的临床AI系统提供了新路径。
链接: https://arxiv.org/abs/2604.26991
作者: Zheng Zhang,Milad Masroor,Cuong Nguyen,Tahir Hassan,Yuanhong Chen,David Rosewarne,Kevin Wells,Thanh-Toan Do,Gustavo Carneiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to Medical Image Analysis
Abstract:Recent advances in data-centric medical AI have produced highly accurate diagnostic systems, but the emphasis on data curation and performance metrics has not translated into widespread clinical adoption. We conjecture that this limited uptake stems from insufficient attention dedicated to the optimisation of fair performance across diverse patient populations and to workflow integration: performance biases can create regulatory barriers, and poorly integrated automation can disrupt clinical routines, degrade the quality of human-AI collaboration, and reduce clinicians’ willingness to adopt AI tools. Prior work on workflow integration (e.g., Learning to Defer (L2D) and Learning to Complement (L2C)) and AI fairness has typically examined these challenges in isolation, overlooking their natural interdependence and the practical constraints of clinical environments, such as restricted clinician availability. We propose People-Centred Medical Image Analysis (PecMan), a human-AI framework that jointly optimises fairness, diagnostic accuracy, and workflow effectiveness through a dynamic gating mechanism that assigns cases to AI, clinicians, or both under clinician workload constraints. We also introduce the Fairness and Human-Centred AI (FairHAI) benchmark for evaluating trade-offs between accuracy, fairness, and clinician workload. Experiments using this benchmark show that PecMan consistently outperforms existing methods, paving the way for more trustworthy and clinically viable AI systems. Code will be available upon paper acceptance.
[AI-124] Simple Self-Conditioning Adaptation for Masked Diffusion Models
【速读】:该论文旨在解决标准掩码扩散模型(Masked Diffusion Models, MDMs)在迭代去噪过程中因掩码位置重复依赖掩码标记(mask token)进行预测而导致的跨步骤优化能力受限问题。其核心解决方案是提出一种轻量级的后训练适配方法——自条件掩码扩散模型(Self-Conditioned Masked Diffusion Models, SCMDM),该方法在每个去噪步骤中引入模型自身先前生成的干净状态预测作为条件输入,从而增强跨步信息传递与精炼能力。关键创新在于无需修改网络结构、不引入递归潜在状态路径或辅助参考模型,且采样时无额外去噪器评估开销,相较于从头训练的部分自条件策略更优,尤其在模型自生成预测具备信息量后,专一化于精炼任务优于混合条件与无条件目标的设计。
链接: https://arxiv.org/abs/2604.26985
作者: Michael Cardei,Huu Binh Ta,Ferdinando Fioretto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model’s own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model’s self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.
[AI-125] Multibit neural inference in a N-ary crossbar architecture
【速读】:该论文旨在解决基于内存计算(In-memory Computing, IMC)架构中神经网络推理的能效与精度权衡问题,特别是针对由多态磁阻随机存取存储器(Magnetic Tunnel Junction, MTJ)构成的交叉阵列在实现矩阵-向量乘法(Matrix-Vector Multiplication, MVM)时所面临的误差来源及其优化策略。其解决方案的关键在于构建一个低假设约束的仿真框架,用于精确建模N元交叉阵列的MVM行为,并系统分析权重量化误差、系统非理想性(如电阻状态偏移)及单元级随机噪声对分类性能的影响;研究发现,尽管权重量化是主要误差源,但通过合理设计每个单元的状态数可在量化误差与电阻状态分辨率之间取得平衡,从而最小化总MVM误差,同时结合主成分分析(PCA)降维进一步缩小软硬件性能差距,最终在MNIST任务上实现了94.48%的准确率(接近软件基准97.56%)。
链接: https://arxiv.org/abs/2604.26979
作者: Anatole Moureaux,Anthony Lopes Temporao,Flavio Abreu Araujo
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 24 pages, 7 figures, 3 tables
Abstract:In-memory computing (IMC) enables energy-efficient neural network inference by computing analog matrix-vector multiplications (MVM) in memory crossbar arrays. In this work we present a simulation framework for N-ary crossbar architectures that retrieves MVM results with minimal implementation assumptions. The XOR and MNIST classification tasks were successfully inferred using a simulated crossbar array of (4x4) 4-states magnetic tunnel junctions (MTJ). MNIST accuracy reached 94.48% (vs. 97.56% software baseline). The software-hardware performance gap was further reduced using PCA dimensionality reduction. We identified weight quantization as the primary error source, and studied its impact alongside systematic nonidealities and random noise. We find that cell-specific random noise is less detrimental than systematic errors due to averaging across the array. Finally, we demonstrate an optimal number of states per cell that balances quantization error against resistance state resolution to minimize total MVM error.
[AI-126] Defeasible Conditional Obligation in a Two-tiered Preference-based Semantics (Extended Version) KR2926
【速读】:该论文旨在解决可废止条件义务(defeasible conditional obligations)的建模问题,特别是针对Horty提出的一个关键关切:如何在规范推理中处理新信息引入时对已有义务的撤销机制。其解决方案的关键在于构建一个双层偏好语义框架(two-tiered, preference-based semantic framework),通过引入两个独立的世界排序——理想性(ideality)和规范性(normality)——分别用于刻画义务的优先级与非单调推理过程。该框架扩展了Hansson-Lewis风格的二元道义逻辑偏好语义,并嵌入非单调推理机制,使得先前推导出的义务能够在出现冲突信息时被合理撤回,同时满足诸如前件强化、包含性和无淹没(no-drowning)等核心合理性准则,并与约束型输入/输出逻辑(constrained input/output logic)建立了联系,从而为规范推理提供了一个形式严谨且具有实际适用性的理论基础。
链接: https://arxiv.org/abs/2604.26977
作者: Xavier Parent
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 13 pages. Extended version of a paper to be presented at KR 2926
Abstract:In response to a concern raised by Horty, this paper develops a two-tiered, preference-based semantic framework for modeling defeasible conditional obligations. The paper extends a Hansson-Lewis style preference semantics for dyadic deontic logic by incorporating a nonmonotonic reasoning mechanism that enables previously derived obligations to be withdrawn when new, potentially conflicting information comes in. The account is bi-preferential: two orderings–ideality and normality–on worlds are employed to address shortcomings in earlier approaches, with a separate ranking method for each. At the nonmonotonic layer, a number of postulates are considered, including antecedent strengthening, inclusion and no-drowning. A connection is established with so-called constrained input/output (I/O) logic–an existing standard for normative reasoning based on a different methodology.
[AI-127] Fitting Horn DL Ontologies to ABox and Query Examples: A Tale of Simulation Quantifiers and Finite Models KR2026
【速读】:该论文旨在解决在描述逻辑(Description Logic, DL)框架下,如何为给定的正例和负例(以ABox和布尔查询的形式表示)构造一个符合要求的本体(ontology),即“拟合”问题。研究对象聚焦于Horn型描述逻辑EL、ELI及其包含bottom概念(⊥)的扩展,查询语言涵盖原子查询(Atomic Queries, AQs)、根连合查询(Rooted Conjunctive Queries, CQs)及它们的并集(Rooted Union of Conjunctive Queries, UCQs)。解决方案的关键在于利用**模拟关系(simulations)**对拟合本体的存在性进行刻画,并基于此构建决策程序,从而精确确定了不同查询类型下的计算复杂度:对于AQs,问题在EL和ELI中均为PTime可解;对于根CQs和UCQs,EL为Σ₂^P-完全,ELI为ExpTime-完全;加入bottom概念不改变上述复杂度。值得注意的是,相较于更复杂的ALC和ALCI系统,EL和ELI虽然语义更简单,但其拟合问题反而引入了额外的技术挑战。
链接: https://arxiv.org/abs/2604.26976
作者: Marvin Grosser,Carsten Lutz
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Submitted to the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR2026)
Abstract:We study the problem of fitting a description logic (DL) ontology to a given set of positive and negative examples that take the form of an ABox and a Boolean query. While previous work has investigated this problem for the expressive DLs ALC and ALCI, we here focus on the Horn DLs EL and ELI, as well as their extensions with the bottom concept. As the query language, we consider atomic queries (AQs), conjunctive queries (rooted CQs), and unions thereof (rooted UCQs). We provide characterization of the existence of a fitting ontology based on simulations, use them to develop decision procedures, and clarify the exact computational complexity. For AQs, the problem is in PTime for both EL and ELI. For rooted CQs and UCQ, it is Sigma_P^2-complete for EL and ExpTime-complete for ELI. Adding the bottom concept does not change any of these complexities. Interestingly, moving from ALC and ALCI to EL and ELI introduces additional technical challenges rather than simplifying the matter.
[AI-128] Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
【速读】:该论文旨在解决大规模GPU推理服务中键值(Key-Value, KV)缓存内存管理所导致的吞吐量和成本效率瓶颈问题。当前系统存在三大 inefficiencies:(1)缺乏统一的KV缓存大小配置,尤其对多头潜在注意力(Multi-Head Latent Attention, MLA)支持不足,造成高达57倍的内存过度分配;(2)KV缓存局限于单一内存层级(GPU高带宽内存HBM),未利用包括CPU DRAM、CXL连接内存、NVMe通过GPUDirect Storage、RDMA网络及并行文件系统在内的六级内存层次结构;(3)采用反应式淘汰策略,丢弃可重用状态,引发冗余计算。解决方案的关键在于提出一个统一架构:首先,设计基于注意力类型感知的缓存大小计算引擎,实现每种注意力机制的精确内存需求估算,最高提升7.4倍批处理规模;其次,构建六级内存层次结构,将单节点有效KV缓存容量从40 GB扩展至超过38 TB,同时保持热数据子毫秒级首次响应时间(Time-to-First-Token, TTFT);最后,引入贝叶斯复用预测器(基于Beta共轭先验建模16种块类型与转换类型组合),结合EMA评分的头部粒度淘汰策略和RoPE感知预取技术,实现70–84%的缓存命中率。组件级验证在ShareGPT、LMSYS-Chat-1M和代理工作负载上的trace回放中证实了高命中率,且结合硬件规格的分析投影显示,相比最先进基线,TTFT降低1.4–2.1倍、吞吐量提升1.7–2.9倍、成本下降47%。
链接: https://arxiv.org/abs/2604.26968
作者: Sanjeev Rao Ganjihal
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: 9 pages, 9 tables, 1 figure. Under review at a systems conference
Abstract:Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache sizing across all attention architectures–particularly multi-head latent attention (MLA), which is unsupported in general-purpose frameworks, resulting in up to 57x memory over-provisioning; (2) confinement of KV cache to a single memory tier (GPU HBM) despite the availability of a rich hierarchy spanning CPU DRAM, CXL-attached memory, NVMe via GPUDirect Storage, RDMA fabric, and parallel filesystems; and (3) reactive eviction policies that discard reusable state, forcing redundant recomputation. We present a unified system that addresses all three problems. Our architecture-variant-aware sizing engine computes exact memory requirements per attention type, enabling up to 7.4x higher batch sizes. A six-tier memory hierarchy extends effective KV cache capacity from 40 GB to over 38 TB per node while maintaining sub-millisecond time-to-first-token (TTFT) for hot entries. A Bayesian reuse predictor with Beta conjugate priors over 16 (block-type, transition-type) pairs achieves 70-84% cache hit rates, combined with EMA-scored head-granular eviction and RoPE-aware prefetching. Component-level validation on trace replay using ShareGPT, LMSYS-Chat-1M, and agentic workloads demonstrates 70-84% cache hit rates. Analytical projections combining validated component behavior with published hardware specifications indicate 1.4-2.1x projected TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction compared to state-of-the-art baselines.
[AI-129] he Impact of AI-Generated Text on the Internet
【速读】:该论文旨在解决互联网上生成式 AI (Generative AI) 文本比例及其对语义多样性、事实准确性与风格多样性等指标影响的量化问题,这一问题此前因缺乏对 AI 生成或辅助内容占比的系统性认知而难以回答。其解决方案的关键在于构建了 2022 至 2025 年间具有代表性的网站样本(基于 Internet Archive),并应用先进的 AI 文本检测工具进行分析,从而首次提供了 AI 内容在互联网中增长趋势的实证数据,并验证了部分假设(如 AI 文本增加与语义多样性下降、积极情绪增强显著相关),同时揭示了公众认知与实证结果之间的显著偏差。
链接: https://arxiv.org/abs/2604.26965
作者: Jonas Dolezal,Sawood Alam,Mark Graham,Maty Bohacek
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments (sometimes subsumed under the Dead Internet Theory). What has hindered answering these questions is that it has not been understood just how much of the internet is actually AI-generated or AI-edited. To this end, we construct a representative sample of websites published on the internet between 2022 and 2025 using the Internet Archive, and apply a state-of-the-art AI text detector on them. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022. We also find statistically significant evidence for some of the identified hypotheses; for example, that increases in AI-generated text on the internet correlate negatively with semantic diversity and positively with the prevalence of positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity. Notably, this diverges from public perception, which we measure in a user study, where the majority of US adults turned out to believe in all four of the above-mentioned hypotheses. Individuals who do not use AI or use it infrequently tend to believe in these negative impacts more than those who use it frequently; similarly, individuals who hold negative views of AI tend to believe in these hypotheses more than those with favorable views of the technology.
[AI-130] Learning-to-Explain through 20Q Gaming: An Explainable Recommender for Cybersecurity Education
【速读】:该论文旨在解决传统网络安全培训方法在应对日益复杂的网络威胁时缺乏直观性和自适应性的问题,尤其体现在学习者难以理解安全决策背后的逻辑与证据链。其解决方案的关键在于提出一种基于可解释人工智能(Explainable AI, XAI)的教育框架——“通过Q20游戏学习解释网络安全”(Learning to Explain Cybersecurity with Q20 Game),并设计了一个新颖的游戏化推荐系统:可解释的Q20网络安全推荐器(Explainable Q20 Cybersecurity Recommender, EQ-20CR)。该系统利用基于策略的强化学习(Policy-Based Reinforcement Learning, RL)代理,将“为何执行此缓解措施?”这一问题转化为一个20个问题(Q20)的游戏机制,主动向用户提问直至识别出最小必要证据集,并生成简洁的对话轨迹来解释推荐的安全教育内容,从而实现个性化、交互式且具备推理透明度的网络安全认知训练。
链接: https://arxiv.org/abs/2604.26964
作者: Mary Nusrat,Sarfuddin Bhuiyan,Gahangir Hossain
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The growing sophistication of contemporary cyber threats necessitates a more effective and adaptive approach to cybersecurity training. Intuitive and adaptive approaches to learning, which are often required, are not provided in traditional learning methods. In this article, we present a new educational framework, “Learning to Explain Cybersecurity with Q20 Game”, based on explainable AI (XAI), an educational game to enhance interactivity in learning. We propose a novel, game-inspired framework - the Explainable Q20 Cybersecurity Recommender (EQ-20CR), that learns to elicit the minimal set of evidential facts needed to justify cybersecurity defensive action. By casting “Why should I execute this mitigation?” as a 20 questions (Q20) game, a policy-based reinforcement-learning (RL) agent actively queries an environment until it can both (i) recommend the optimal security education and (ii) explain that decision with a concise dialogue trace. The article draws from “Playing 20 Question Game with Policy-Based Reinforcement Learning” [1] and “Learning-to-Explain: Recommendation Reason Determination through Q20 Gaming” [2]. The framework uses a policy-based reinforcement learning (RL) agent that leads the user through a sequence of questions to recognize and articulate a targeted cybersecurity concept, attack vector, or defense strategy. Furthermore, users are gradually exposed to informative questions by the system, revealing complicated, structured way at an adaptive difficulty level. In this paper, we design the architecture, its application to various concepts of cybersecurity through illustrative case studies, and its transformative potential on the training and awareness of cybersecurity recommendations.
[AI-131] Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding ACL2026
【速读】:该论文旨在解决基于语言模型(Language Models, LMs)的静态程序切片(Static Program Slicing)方法中存在的两个核心问题:一是依赖建模不准确,即语言模型难以捕捉精确的数据流关系;二是生成过程缺乏约束,导致输出切片中出现幻觉(hallucination)token和语句。解决方案的关键在于提出Sliceformer,其创新性地将静态程序切片重构为序列到序列任务,并引入两项关键技术:首先,设计基于数据流图(Data Flow Graph, DFG)的预训练目标,通过数据流保持的语句重排和数据流感知的跨度掩码机制增强模型对数据依赖的理解;其次,开发一种联合词汇和语法约束的受限解码机制,有效抑制幻觉内容的生成。实验表明,Sliceformer在Java和Python基准上均显著优于现有最优方法,ExactMatch指标提升最高达22%。
链接: https://arxiv.org/abs/2604.26961
作者: Pengfei He,Shaowei Wang,Tse-Hsun(Peter)Chen,Muhammad Asaduzzaman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Accepted at ACL 2026
Abstract:Static program slicing is a fundamental software engineering technique for isolating code relevant to specific variables. While recent learning-based approaches using language models (LMs) show promise in automating slice prediction, they suffer from inaccurate dependency modeling and unconstrained generation, where LMs fail to capture precise data flow relations and produce slices containing hallucinated tokens and statements. To address these challenges, we propose Sliceformer, a novel approach that reformulates static program slicing as a sequence-to-sequence task using small language models such as CodeT5+. Sliceformer introduces two key innovations that directly target the identified limitations. First, to improve dependency modeling, we design dataflow-aware pretraining objectives that leverage data flow graphs (DFG) to teach models data dependencies through dataflow-preserving statement permutation and dataflow-aware span corruption. Second, to eliminate hallucination, we develop a constrained decoding mechanism that enforces both lexical and syntactic constraints. We evaluate Sliceformer on Java and Python program slicing benchmarks, demonstrating consistent improvements over state-of-the-art baselines with up to 22% gain in ExactMatch.
[AI-132] LLM Biases
【速读】:该论文旨在解决基于Transformer的生成式推荐代理(generative recommender agents)在大规模部署中可能引入系统性偏差或扭曲的问题,即这些模型虽然在性能上表现优异,但其内部机制可能导致用户选择和曝光的非公平分布。解决方案的关键在于通过理论分析识别出四种机制层面的偏差通道:位置偏差(positional bias)、流行度放大(popularity amplification)、潜在驱动偏差(latent driver bias)以及合成数据偏差(synthetic data bias),揭示了传统离线评估指标难以捕捉的可靠性风险,并建议管理者将这些偏差视为运营风险因素进行持续监控,而非仅依赖性能提升来判断系统可靠性。
链接: https://arxiv.org/abs/2604.26960
作者: Jinhui Han,Ming Hu,Xilin Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer-based agentic AI is rapidly being deployed on major platforms to help users shop, watch, and navigate content with less effort. While these systems can deliver impressive performance, a key concern is whether they may be less reliable than they appear. We ask a simple but fundamental question: whether the mechanisms that make transformer-based agents effective can also induce systematic biases or distortions? We study this question through a theoretical analysis of transformer-based generative recommenders, in which the next user interaction is generated sequentially from the user history. Focusing on how the model allocates attention across historical evidence, we identify four bias channels: (i) Positional bias: stronger positional encoding shifts influence toward recent history, improving responsiveness but potentially reducing stability and long-term diversity; (ii) Popularity amplification: small frequency differences in data can be magnified into disproportionate exposure, contributing to Matthew effects and echo chambers; (iii) Latent driver bias: when important drivers of user choices are not directly observed, the model can place overly concentrated weight on a small subset of past events, creating overconfident attributions. (iv) Synthetic data bias: when users increasingly follow AI suggestions and platforms retrain on model-shaped synthetic logs, outputs can concentrate over time, and long-tail alternatives can disappear first. Our analysis highlights mechanism-level reliability risks that may not be visible in offline performance metrics. The four bias channels indicate that large-scale deployment may systematically distort exposure and choice. For managers, the immediate implication is to treat these as operational risk factors and to monitor concentration and drift over time, rather than assuming that performance gains alone guarantee reliability.
[AI-133] Designing Ethical Learning for Agent ic AI: Toegye Yi Hwangs Ethical Emotion Regulation Framework
【速读】:该论文旨在解决自主型智能体(Agentic AI)在学习环境中因具备自主目标设定与主动干预能力,而引发的道德情感调节难题。现有框架多将情绪视为反应性反馈或参与度优化工具,忽视了在自主决策过程中对道德情感进行规范性调控的需求。其解决方案的关键在于提出一种基于李滉(Toegye Yi Hwang)道德情感哲学的伦理情感调节框架(Ethical Emotion Regulation Framework),并构建五阶段架构——“伦理情感反馈系统”(EEFS),该系统与智能体行为循环相匹配,明确各阶段的设计原则与应用场景,并配套开发了“EEFS评估工具”,以实现对Agentic AI系统中道德情感一致性的系统化评估。
链接: https://arxiv.org/abs/2604.26958
作者: Ji Yeon Kim
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI systems capable of autonomous goal setting and proactive intervention introduce new challenges for regulating moral-emotional processes in learning environments. Existing frameworks typically treat emotion as reactive feedback or engagement optimization, overlooking the need for normative regulation across autonomous decision this http URL paper proposes an ethical emotion regulation framework for agentic AI learning design inspired by Toegye Yi Hwang’s moral-emotional philosophy. The Ethical Emotion Feedback System (EEFS) is reconstructed as a five-stage architecture aligned with agentic cycles, articulating stage-specific design principles and scenario this http URL EEFS Evaluation Instrument is introduced to enable systematic assessment of moral-emotional alignment in agentic AI systems.
[AI-134] Simulating Validity: Modal Decoupling in MLLM Generated Feedback on Science Drawings
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成学生手绘科学模型反馈时存在的“接地失败”(grounding failure)问题,即模型输出虽形式上符合教学逻辑,但与学生实际绘制内容不一致,表现为对象错位、属性错误、关系错误或虚假缺失等。研究通过分析150份中学生关于分子动能理论的绘画作品及对应300条GPT-5.1生成的反馈,发现41.3%的反馈存在至少一种接地错误,其中“虚假缺失”是最主要类型;尽管采用“清单优先”(inventory-list-first)的工作流程可减少部分错误,但无法根本消除问题,表明当前提示策略不足以实现有效视觉证据对齐。因此,解决方案的关键在于开发超越常规提示技术的新型接地机制,以确保反馈内容严格基于学生手绘图像的具体视觉特征。
链接: https://arxiv.org/abs/2604.26957
作者: Arne Bewersdorff,Nejla Yuruk,Xiaoming Zhai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted as AIED Short Paper 2026, Seoul, South Korea. Submission #1147. This is the long paper version
Abstract:In science education, students frequently construct hand-drawn visual models of scientific phenomena. These drawings rely on a visual structure where information is encoded through visual objects, their attributes, and relationships. Multimodal large language models (MLLMs) are increasingly used to generate feedback on students’ hand-drawn scientific models. However, the validity of such feedback depends on whether model claims are grounded in the specific visual evidence of the student drawing. This study uncovers grounding failures, consistent with modal decoupling, in off-the-shelf MLLM feedback, where outputs remain pedagogically plausible in form while contradicting the drawing or treating depicted elements as missing. Using N = 150 middle school drawings from a kinetic molecular theory unit spanning five modeling tasks and three competence levels, we generated N = 300 feedback instances with GPT-5.1. All outputs were coded for four grounding error types: object mismatch, attribute mismatch, relation mismatch, and false absence. Grounding failures were common: 41.3% of feedback instances contained at least one error. An inventory-list-first workflow reduced several error categories and lowered the overall error rate, but it did not resolve the underlying limitation: approximately one in three outputs remained flawed, with false absence as the dominant failure mode. Moreover, feedback that appears visually grounded offered little diagnostic value for identifying invalid instances. The findings indicate that modal decoupling is a substantial limitation and that valid feedback will require grounding mechanisms beyond common prompting strategies.
[AI-135] Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories
【速读】:该论文旨在解决生成式 AI (Generative AI) 辅助教学系统在工程实验课程中面临的“辅助过度”与“学习机会保留”之间的矛盾问题,即如何在提供有效帮助的同时保障学生的学习挑战性与自主探索空间。其解决方案的关键在于构建一个双组件的路由与治理系统:Routiium 作为兼容 OpenAI 的网关,实现多大语言模型(Large Language Model, LLM)后端的灵活管理、提示词配置和使用日志记录;EduRouter 作为策略感知的路由服务,通过基于嵌入(embedding-based)的问题匹配机制、实验室预算控制、审批流程以及任务意图识别,动态分配最优模型资源。实验证明,该方案显著提升了教学挑战一致性(challenge-alignment index)和指导贴合度(overlay-adherence score),同时延长了“有效挣扎窗口”(productive-struggle window),并在实际查询回放中实现高达 66% 的 Token 成本降低,且保持 100% 的标准答案命中率。
链接: https://arxiv.org/abs/2604.26955
作者: Emmanuel A. Olowe,Danial Chitnis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: IEEE EduCon
Abstract:AI tutoring systems in engineering labs face a tension between providing sufficient assistance and preserving learning opportunities. Existing systems typically offer instructors limited control over assistance timing, content, or cost. This paper describes a routing and governance system for LLM-based lab assistance comprising two components: Routiium, an OpenAI-compatible gateway that manages multiple LLM backends with configurable prompt modifications and usage logging, and EduRouter, a policy-aware routing service that enforces per-lab budgets, approval workflows, and embedding-based question matching. We evaluated the system using trace-driven simulation calibrated from two engineering labs (LED characterization, RC circuit analysis) and a 100-query replay through live models. In simulations, governed policies (P1/P2) increased challenge-alignment index from 0.90 to 0.98 and overlay-adherence score from 0.69 to 0.87 compared to ungoverned operation (P0). The productive-struggle window metric increased from 1.4 to 3.6 simulated turns before high-scaffold hints appeared. In the 100-query replay, EduRouter routed 75% of queries to a local model, reducing token costs by 66% ( 0.087 vs. 0.26 for all-premium routing) while maintaining canonical hit rate of 1.0 for the curated 89-intent question bank. We release Routiium, EduRouter, canonical-task tooling, and simulator configurations to support replication and future classroom studies.
[AI-136] he Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost
【速读】:该论文旨在解决如何在使用大语言模型(Large Language Models, LLMs)进行自动化评分时,实现高准确率与计算效率之间的最优平衡。其核心问题在于:相较于传统的模型集成(ensembling)方法,是否可以通过更精细的模型选择策略和推理设置来提升评分性能。研究发现,关键解决方案在于采用战略性的模型选择(如选用Gemini 3.1 Pro Preview在低推理强度下获得最高精度)和推理努力水平的优化调整(如增加推理步数显著提升准确性),而非单纯扩大集成规模;同时,GPT-5.4 Nano与Mini在无额外推理的情况下实现了最佳的成本-性能权衡,表明推理强度与模型类型需协同配置以最大化效率。
链接: https://arxiv.org/abs/2604.26954
作者: Scott Frohn
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 tables, 2 figures. Presented at the 2026 National Council on Measurement in Education (NCME) Annual Meeting, April 11, 2026, Los Angeles, CA
Abstract:Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-performance balance.
[AI-137] Agent ic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的网页代理在执行重复性任务时面临的“重跑危机”(Rerun Crisis)问题,即随着执行频率增加,推理成本(token消耗和API延迟)呈线性增长,导致经济上不可持续。其解决方案的关键在于提出一种“编译-执行”(Compile-and-Execute)架构:通过一次性的LLM调用,将浏览器DOM状态转换为轻量级语义表示,并生成确定性的JSON工作流蓝图;随后由轻量级运行时直接驱动浏览器执行,无需进一步模型查询。该方法将推理复杂度从O(M × N)(M为重跑次数,N为动作序列长度)降低至摊销后的O(1),显著提升可扩展性和经济可行性。
链接: https://arxiv.org/abs/2604.09718
作者: Jagadeesh Chundru
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 12 pages, 4 figures, 2 tables. v2: Expanded literature review and clarified architecture limitations
Abstract:LLM-driven web agents operating through continuous inference loops – repeatedly querying a model to evaluate browser state and select actions – exhibit a fundamental scalability constraint for repetitive tasks. We characterize this as the Rerun Crisis: the linear growth of token expenditure and API latency relative to execution frequency. For a 5-step workflow over 500 iterations, a continuous agent incurs approximately 150.00 USD in inference costs; even with aggressive caching, this remains near 15.00 USD. We propose a Compile-and-Execute architecture that decouples LLM reasoning from browser execution, reducing per-workflow inference cost to under 0.10 USD. A one-shot LLM invocation processes a token-efficient semantic representation from a DOM Sanitization Module (DSM) and emits a deterministic JSON workflow blueprint. A lightweight runtime then drives the browser without further model queries. We formalize this cost reduction from O(M x N) to amortized O(1) inference scaling, where M is the number of reruns and N is the sequential actions. Empirical evaluation across data extraction, form filling, and fingerprinting tasks yields zero-shot compilation success rates of 80-94%. Crucially, the modularity of the JSON intermediate representation allows minimal Human-in-the-Loop (HITL) patching to elevate execution reliability to near-100%. At per-compilation costs between 0.002 USD and 0.092 USD across five frontier models, these results establish deterministic compilation as a paradigm enabling economically viable automation at scales previously infeasible under continuous architectures.
[AI-138] VibroML: an automated toolkit for high-throughput vibrational analysis and dynamic instability remediation of crystalline materials using machine-learned potentials
【速读】:该论文旨在解决材料计算中动态不稳定性(dynamical instability)的自动化修复问题,即在机器学习势(machine-learned interatomic potentials, MLIPs)加速声子谱计算后,如何自动识别并消除不稳定结构以获得物理上可行的稳定相。其解决方案的关键在于提出VibroML——一个基于基础MLIP驱动的开源Python工具包,采用能量引导的遗传算法(energy-guided genetic algorithm)替代传统软模追踪方法,高效探索势能面以发现多种动态稳定的多晶型结构;同时结合ProtoCSP组合结构预测引擎,通过靶向合金化稳定受挫晶体拓扑结构(如Cs₂KInI₆和KTaSe₃),实现从结构修复、热力学验证到系统性成分筛选的全流程自动化,从而突破标准高通量流程的局限,生成具有物理合理性的结构候选。
链接: https://arxiv.org/abs/2604.27685
作者: Rogério Almeida Gouvêa,Gian-Marco Rignanese
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:
Abstract:While machine-learned interatomic potentials (MLIPs) accelerate phonon dispersion calculations, merely identifying dynamical instabilities in computationally predicted materials is insufficient; automated pathways to resolve them are required. We introduce VibroML, an open-source Python toolkit driven by foundational MLIPs that shifts the paradigm from stability verification to automated structural remediation. VibroML employs an energy-guided genetic algorithm that vastly outperforms traditional soft-mode following, efficiently navigating the potential energy surface to uncover diverse, dynamically stable polymorphs. As 0 K harmonic stability does not guarantee macroscopic viability, an automated molecular dynamics workflow evaluates finite-temperature structural retention. VibroML also couples with ProtoCSP, our combinatorial structure prediction engine, to stabilize frustrated crystal topologies via targeted alloying, successfully rescuing functional perovskite networks like Cs _2 KInI _6 and KTaSe _3 . Demonstrating broader applicability, we mined the Alexandria database – where ~50% of quaternary and 99.5% of quinary elemental combinations lack any structural entries – to identify thousands of abandoned, high-symmetry stoichiometries. Deploying ProtoCSP’s “cold start” retrieval and VibroML’s evolutionary search on a sample, we successfully identified dynamically stable low-symmetry candidates. Through integrated structural remediation, thermal validation, and systematic compositional exploration, VibroML enables a comprehensive deep-screening approach, yielding physically sound structural propositions that far surpass standard high-throughput workflows.
[AI-139] Sampler-Robust Optimization under Generative Models
【速读】:该论文旨在解决现代随机优化流程中因生成式模型(Generative Model)不确定性表示不准确而导致的决策可靠性问题。当前方法通常依赖蒙特卡洛场景评估下游决策,但这种做法将不确定性操作对象从显式的概率分布转移到了学习得到的生成器所诱导的采样器(Sampler),从而引入两类误差:采样器误设误差(sampler misspecification)和有限模拟误差(finite-simulation error)。解决方案的关键在于提出采样器鲁棒优化(Sampler-Robust Optimization, SRO),其核心思想是通过扰动已学习的生成器来构造最坏情况下的采样器,并在此基础上优化决策;这一“采样器优先”的建模框架与基于仿真的决策流程一致,并具备尖锐性感知特性——即偏好在生成器扰动下性能稳定的决策,而非仅在名义采样器下表现良好。理论分析表明,在覆盖假设下,经验最坏情况目标可提供真实总体目标的高概率上界,且有限模拟误差部分被用于抵御采样器误设风险的鲁棒化过程所吸收。该框架适用于具有或不具显式密度的生成模型,并支持高效的极小极大求解策略。
链接: https://arxiv.org/abs/2604.27447
作者: Ziwei Zhang,Jonathan Yu-Meng Li
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Portfolio Management (q-fin.PM); Risk Management (q-fin.RM)
备注:
Abstract:Modern stochastic optimization pipelines increasingly rely on learned generative models to represent uncertainty, while downstream decisions are evaluated almost entirely through Monte Carlo scenarios. This shifts the operational object of uncertainty from an explicit probability law to the sampler induced by the learned generator. Reliability therefore depends on two errors: sampler misspecification and finite-simulation error. We propose Sampler-Robust Optimization (SRO), which optimizes decisions against the worst-case sampler induced by perturbing the learned generator. This sampler-first formulation aligns with simulation-based decision pipelines and admits a sharpness-aware interpretation: it favors decisions whose performance is stable under generator perturbations, rather than merely under the nominal sampler. Under a coverage assumption, we show that the empirical worst-case objective provides a high-probability upper certificate for the true population objective, with finite-simulation error partially absorbed by the robustification used to guard against sampler misspecification. The framework accommodates generative models with or without explicit densities and admits efficient minimax procedures. Portfolio-optimization experiments show that SRO produces more stable decisions and improves out-of-sample performance under distribution shift.
[AI-140] owards Accelerated SCF Workflows with Equivariant Density-Matrix Learning and Analytic Refinement
【速读】:该论文旨在解决自洽场(SCF)计算中初始猜测质量低导致收敛缓慢的问题,尤其在分子体系电子结构计算中,传统初始化方法常需大量迭代才能达到收敛。解决方案的关键在于提出一种物理约束的等变神经网络模型 \textsc{dm-PhiSNet},其直接从分子几何结构预测原子轨道(AO)基组下的单电子约化密度矩阵(1-RDM),并通过两阶段训练引入物理驱动的目标函数,在此基础上设计一个轻量级解析模块对预测结果进行精修:该模块强制满足电子数守恒、使1-RDM在AO度量下趋近广义幂等性(generalized idempotency),并正则化Löwdin正交化后的占据谱。此方法显著减少了SCF迭代次数(49–81%),且无需力监督即可获得高精度的一次性总能和Hellmann–Feynman原子力,证明了模型对化学电子结构的有效捕捉。
链接: https://arxiv.org/abs/2604.27256
作者: Zuriel Y. Yescas-Ramos,Andrés Álvarez-García,Huziel E. Sauceda
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
备注: 7 pages, 3 figures
Abstract:We present \textscdm-PhiSNet, a physically constrained \textscPhiSNet-based equivariant model that predicts one-electron reduced density matrices (1-RDMs) directly from molecular geometries in an atomic-orbital (AO) basis for accelerated self-consistent field (SCF) workflows. Training follows a two-stage schedule with progressively introduced physically motivated objectives, and the resulting predictions are refined by a lightweight analytic block. This block enforces electron-number conservation, drives the 1-RDM toward generalized idempotency in the AO metric, and regularizes the occupation spectrum of the Löwdin-orthogonalized density. Across six closed-shell systems – H _2 O, CH _4 , NH _3 , HF, ethanol, and NO _3^- – the refined 1-RDMs provide SCF initial guesses that substantially reduce iteration steps by 49–81% relative to standard initializations. Beyond SCF acceleration, the learned 1-RDMs yield accurate one-shot total energies and Hellmann–Feynman atomic forces without force supervision, indicating that the model captures chemically meaningful electronic structure. These results demonstrate that combining equivariant learning with analytic constraint enforcement provides a simple, general route to solver-ready density-matrix initializations and accelerated SCF workflows.
[AI-141] Entropy-Dominated Temporal Vocal Dynamics as Digital Biomarkers for Depression Detection
【速读】:该论文旨在解决自动化抑郁症检测中因静态聚合对话信号而导致临床行为动态信息被掩盖的问题。其解决方案的关键在于引入基于熵(entropy)驱动的时序生物标志物,通过分析话语层面的声学轨迹动态特性,而非仅依赖传统的平均特征聚合。实验表明,熵生物标志物相较于静态池化基线显著提升了检测性能(AUC 0.646),且优于递归定量、耦合、样本熵和分形复杂度等其他时序特征,证明抑郁相关的信号更可能存在于对话动态的不确定性(熵)而非平均声学水平中,从而支持了基于时间维度的数字表型在心理健康评估中的应用价值。
链接: https://arxiv.org/abs/2604.26998
作者: Himadri S Samanta
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages
Abstract:Automated depression detection often relies on static aggregation of conversational signals, potentially obscuring clinically meaningful behavioral dynamics. We investigated whether entropy-driven temporal biomarkers improve depression detection beyond standard pooled features using the DAIC-WOZ corpus. Using 142 labeled participants, we reconstructed utterance-level acoustic trajectories and compared pooled temporal baselines, trajectory dynamics, Shannon entropy biomarkers, recurrence quantification, sample entropy, fractal complexity, and coupling biomarkers under leakage-aware validation. Static pooling achieved an AUC of 0.593, trajectory dynamics improved performance to 0.637, and entropy biomarkers produced the strongest statistically significant improvement over pooled baselines (AUC 0.646; nested cross-validated AUC 0.615; permutation p = 0.017). Entropy biomarkers outperformed recurrence, coupling, sample entropy, and fractalbased features, with several biomarkers stable across folds. These findings suggest depression-related signal may lie less in average acoustic levels than in entropy of conversational dynamics, supporting temporally informed digital phenotypes for mental-health assessment.
机器学习
[LG-0] An adaptive wavelet-based PINN for problems with localized high-magnitude source
链接: https://arxiv.org/abs/2604.28180
作者: Himanshu Pandey,Ratikanta Behera
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, physics-informed neural networks (PINNs) have gained significant attention for solving differential equations, although they suffer from two fundamental limitations, namely, spectral bias inherent in neural networks and loss imbalance arising from multiscale phenomena. This paper proposes an adaptive wavelet-based PINN (AW-PINN) to address the extreme loss imbalance characteristic of problems with localized high-magnitude source terms. Such problems frequently arise in various physical applications, such as thermal processing, electro-magnetics, impact mechanics, and fluid dynamics involving localized forcing. The proposed framework dynamically adjusts the wavelet basis function based on residual and supervised loss. This adaptive nature makes AW-PINN handle problems with high-scale features effectively without being memory-intensive. Additionally, AW-PINN does not rely on automatic differentiation to obtain derivatives involved in the loss function, which accelerates the training process. The method operates in two stages, an initial short pre-training phase with fixed bases to select physically relevant wavelet families, followed by an adaptive refinement that adapts scales and translations without populating high-resolution bases across entire domains. Theoretically, we show that under certain assumptions, AW-PINN admits a Gaussian process limit and derive its associated NTK structure. We evaluate AW-PINN on several challenging PDEs featuring localized high-magnitude source terms with extreme loss imbalances having ratios up to 10^10:1 . Across these PDEs, including transient heat conduction, highly localized Poisson problems, oscillatory flow equations, and Maxwell equations with a point charge source, AW-PINN consistently outperforms existing methods in its class.
[LG-1] Strait: Perceiving Priority and Interference in ML Inference Serving
链接: https://arxiv.org/abs/2604.28175
作者: Haidong Zhao,Nikolaos Georgantas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emphStrait, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks. Compared to software-defined preemption approaches, Strait also exhibits more equitable performance.
[LG-2] Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models
链接: https://arxiv.org/abs/2604.28149
作者: Matthias Hertel,Alexandra Nikoltchovska,Sebastian Pütz,Ralf Mikut,Benjamin Schäfer,Veit Hagenmeyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time Series Foundation Models (TSFMs) have recently emerged as general-purpose forecasting models and show considerable potential for applications in energy systems. However, applications in critical infrastructure like power grids require transparency to ensure trust and reliability and cannot rely on pure black-box models. To enhance the transparency of TSFMs, we propose an efficient algorithm for computing Shapley Additive Explanations (SHAP) tailored to these models. The proposed approach leverages the flexibility of TSFMs with respect to input context length and provided covariates. This property enables efficient temporal and covariate masking (selectively withholding inputs), allowing for a scalable explanation of model predictions using SHAP. We evaluate two TSFMs - Chronos-2 and TabPFN-TS - on a day-ahead load forecasting task for a transmission system operator (TSO). In a zero-shot setting, both models achieve predictive performance competitive with a Transformer model trained specifically on multiple years of TSO data. The explanations obtained through our proposed approach align with established domain knowledge, particularly as the TSFMs appropriately use weather and calendar information for load prediction. Overall, we demonstrate that TSFMs can serve as transparent and reliable tools for operational energy forecasting.
[LG-3] Global Optimality for Constrained Exploration via Penalty Regularization
链接: https://arxiv.org/abs/2604.28144
作者: Florian Wolf,Ilyas Fatkhullin,Niao He
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. To our knowledge, the only prior model-free policy-gradient approach for this setting under general policy parameterization is due to Ying et al. (2025). Unfortunately, their guarantees are limited to weak regret and ergodic averages, which do not imply that the final output is a single deployable policy that is near-optimal and nearly feasible. In this work we take a different approach to this problem, and propose Policy Gradient Penalty (PGP) method, a single-loop policy-space method that enforces general convex occupancy-measure constraints via quadratic-penalty regularization. PGP constructs pseudo-rewards that yield gradient estimates of the penalized objective, subsequently exploiting the classical Policy Gradient Theorem. We further establish the regularity of the penalized objective, providing the smoothness properties needed to justify the convergence of PGP. Leveraging hidden convexity and strong duality, we then establish global last-iterate convergence guarantees, attaining an \epsilon -optimal constrained entropy value with \epsilon bounded constraint violation despite policy-induced non-convexity. We validate PGP through ablations on a grid-world benchmark and further demonstrate scalability on two challenging continuous-control tasks.
[LG-4] Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
链接: https://arxiv.org/abs/2604.28109
作者: Junqi Gao,Dazhi Zhang,Zhichang Guo,Biqing Qi,Yi Ran,Wangmeng Zuo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging has attracted attention as an effective path toward multi-task adaptation by integrating knowledge from multiple task-specific models. Among existing approaches, dynamic merging mitigates performance degradation caused by conflicting parameter updates across tasks by flexibly combining task-specific parameters at inference time, thereby maintaining high performance. However, these methods require storing independent parameters for each task, resulting in prohibitive storage overhead. To address this issue, we first experimentally demonstrate that the fine-tuned weight increments (referred to as task vectors) exhibit an impulse-like activation pattern and high robustness to low-bit representations. Driven by this insight, we propose T-Switch, which decomposes task vectors into three compact components: a binary sparse mask, a sign vector, and a scalar scaling factor, achieving high-fidelity approximation at high compression ratios. We then introduce Auto-Switch, a training-free merging scheme that automatically composes task vectors via feature similarity retrieval. Building on this, we develop Auto-Switch, a training-free merging scheme that automatically assembles task vectors through feature similarity retrieval. Furthermore, to transform task vector sparsification and quantization from static rules to adaptive learning, we propose FlexSwitch, a learnable framework which jointly optimizes the compression strategy for each model unit via Learnable Gating Sparsification (LGS) and Bit-width Adaptive Selection (BAS), while employing the Sparsity-Aware Storage Strategy (SASS) to select the optimal storage encoding structure. Finally, by incorporating a K-Nearest Neighbor (KNN) inference scheme with a learnable low-rank metric, we present Auto-FlexSwitch, a dynamic model merging approach that supports highly efficient task vector compression.
[LG-5] Neural Aided Kalman Filtering for UAV State Estimation in Degraded Sensing Environments
链接: https://arxiv.org/abs/2604.28107
作者: Akhil Gupta,Erhan Guven
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate state estimation of nonlinear dynamical systems is fundamental to modern aerospace operations across air, sea, and space domains. Online tracking of adversarial unmanned aerial vehicles (UAVs) is especially challenging due to agile nonlinear motion, noisy and sparse sensor measurements, and unknown control inputs; conditions that violate key assumptions of classical Kalman filter variants and degrade estimation performance. Neural networks (NNs) can learn complex nonlinear relationships from data, but lack principled uncertainty quantification, which is critical for state estimation tasks where confidence bounds drive downstream decisions. We address this with Bayesian Neural Networks (BNNs), which model uncertainty through distributions over network weights and produce predictive means and uncertainties via Monte Carlo sampling. Building on this, we propose the Bayesian Neural Kalman Filter (BNKF): a hybrid framework coupling a trained BNN with a Kalman correction step for robust online UAV state estimation. Unlike related neural Kalman approaches, BNKF produces full state predictions and incorporates Bayesian uncertainty directly into covariance propagation, improving robustness under high noise conditions. We evaluate BNKF under varying radar noise levels and sampling rates using synthetic nonlinear UAV flight data. Five fold cross validation demonstrates that BNKF outperforms Extended and Unscented Kalman Filters in accuracy, precision, and truth containment under degraded sensing. An ensemble variant (BNKFe) further improves precision in high-noise edge cases at a slight accuracy tradeoff. Runtime analysis confirms minimal inference overhead, supporting real-time deployment feasibility.
[LG-6] FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing
链接: https://arxiv.org/abs/2604.28102
作者: Arthur Corrêa,Paulo Nascimento,Samuel Moniz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving practical multi-depot vehicle routing problems (MDVRP) is a challenging optimization task central to modern logistics, increasingly driven by e-commerce. To address the MDVRP’s computational complexity, neural-based combinatorial optimization methods offer a promising scalable alternative to traditional approaches. However, neural-based methods typically rely on rigid architectures and input encodings tailored to specific problem formulations. In real-world settings, heterogeneous constraints create multiple MDVRP variants, limiting the applicability of such models. While multi-task learning (MTL) has begun to accelerate the development of unified neural-based solvers, prior works focus almost exclusively on single-depot VRPs, leaving the MDVRP unaddressed. To bridge this gap, we propose Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing (FiLMMeD), a novel unified neural-based model for 24 different MDVRP variants. We introduce three main contributions: (1) to improve the model’s generalization, we augment the standard Transformer encoder with Feature-wise Linear Modulation (FiLM), which dynamically conditions learned internal representations based on the active set of constraints; (2) we provide an initial demonstration of Preference Optimization in the MTL setting, establishing it as a superior alternative to Reinforcement Learning for future MTL works; (3) to mitigate the generalization gap caused by the introduction of multi-depot constraints, we introduce a targeted curriculum learning strategy that progressively exposes the model to increasingly more complex constraint interactions. Extensive experiments on 24 MDVRP variants (including 8 novel formulations) and 16 single-depot VRPs confirm the effectiveness of FiLMMeD, which consistently outperforms state-of-the-art baselines. Our code is available at: this https URL
[LG-7] A Unified Framework of Hyperbolic Graph Representation Learning Methods
链接: https://arxiv.org/abs/2604.28070
作者: Sofía Pérez Casulo,Marcelo Fiori,Bernardo Marenco,Federico Larroca
类目: Machine Learning (cs.LG)
*备注: submitted
Abstract:Hyperbolic geometry has emerged as an effective latent space for representing complex networks, owing to its ability to capture hierarchical organization and heterogeneous connectivity patterns using low-dimensional embeddings. As a result, numerous hyperbolic graph representation learning methods have been proposed in recent years. However, their practical adoption and systematic comparison remain challenging, as implementations are fragmented and shared tools for reproducible and fair evaluation are lacking. In this work, we introduce a unified open-source framework for hyperbolic graph representation learning that integrates several widely used embedding methods under a common optimization interface. The novel framework enables consistent training, visualization, and evaluation of hyperbolic embeddings, and interfaces seamlessly with standard network analysis tools. Leveraging this unified setup, we conduct an experimental study of hyperbolic embedding methods on real-world networks, focusing on two canonical downstream tasks: link prediction and node classification. Beyond predictive accuracy, the study offers practical insights into the strengths and limitations of existing approaches, thereby facilitating informed method selection and fostering reproducible research in hyperbolic graph representation learning.
[LG-8] Early Detection of Water Stress by Plant Electrophysiology: Machine Learning for Irrigation Management
链接: https://arxiv.org/abs/2604.28038
作者: Eduard Buss,Till Aust,Heiko Hamann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Purpose: Fast detection of plant stress is key to plant phenotyping, precision agriculture, and automated crop management. In particular, efficient irrigation management requires early identification of water stress to optimize resource use while maintaining crop performance. Direct physiological sensing offers the potential to detect stress responses before visible symptoms appear. Methods: In this study, we recorded electrophysiological signals from greenhouse-grown tomato plants subjected to water stress and developed a framework based on machine learning for online stress detection. The recorded time-series data were processed using a processing pipeline that includes statistical feature extraction and selection, automated machine learning or alternatively deep learning, and probability calibration. Results: Across multiple input time horizons, we found that a 30-minute look-back window strikes the best balance between rapid decision-making and classification performance. Using automated machine learning, the framework achieved classification accuracies of up to 92%, outperforming deep learning approaches. Sequential backward selection reduced the feature set while maintaining performance. Importantly, the framework detects transitions from healthy to stressed states in recordings that were not included in the training set. Conclusion: Overall, we provide a decision-support tool for farmers and establish a foundation for biofeedback-driven irrigation control to improve resource efficiency in (semi-)autonomous crop production systems.
[LG-9] Exponential families from a single KL identity
链接: https://arxiv.org/abs/2604.28036
作者: Marc Dymetman
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Exponential families encompass the distributions central to modern machine learning – softmax, Gaussians, and Boltzmann distributions – and underlie the theory of variational inference, entropy-regularized reinforcement learning, and RLHF. We isolate a simple identity for exponential families that expresses the KL difference \mathrmKL(q | p_\lambda_2) - \mathrmKL(q | p_\lambda_1) in terms of the log-partition function A(\lambda) and the moment \mu_q . Remarkably, this identity together with the single fact that \mathrmKL \geq 0 (with equality iff p = q ) suffices, by direct substitution and rearrangement, to derive a cluster of results that are classically obtained by separate, heavier arguments: a generalized three-point identity for arbitrary reference distributions, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function, identification of its Legendre dual in KL terms, the Gibbs variational principle, and the explicit optimizer in KL-regularized reward maximization, including the exponential tilting formula underlying entropy-regularized control and RLHF. Beyond these purely algebraic consequences, standard analytic arguments recover the gradient formula for the log-partition function, the Bregman representation of within-family KL divergence, and the surjectivity of the moment map. The note is self-contained.
[LG-10] Shuffling-Aware Optimization for Private Vector Mean Estimation
链接: https://arxiv.org/abs/2604.28032
作者: Shun Takagi,Seng Pei Liew
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study d -dimensional unbiased mean estimation in the single-message shuffle model, where each user sends a single privatized message and the analyzer only observes the shuffled multiset of reports. While minimax-optimal mechanisms are well understood in the local differential privacy setting, the corresponding notion of optimality after shuffling has remained largely unexplored. To address this gap, we introduce the recently proposed shuffle index and use it to formulate the post-shuffling mechanism design problem as an explicit optimization problem. We then establish a minimax lower bound on the achievable mean squared error in terms of the shuffle index, which implies that mechanisms that are optimal under LDP can become suboptimal once shuffling is applied. Finally, we construct an asymptotically minimax optimal mechanism in the high privacy regime, which as a consequence achieves a privacy-utility trade-off nearly identical to that of the central Gaussian mechanism.
[LG-11] FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning CVPR2026
链接: https://arxiv.org/abs/2604.28024
作者: Zhiqiang Kou,Junxiang Wu,Wenke Huang,Wenwen He,Ming-Kun Xie,Changwei Wang,Yuheng Jia,Di Jiang,Yang Liu,Xin Geng,Qiang Yang
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR 2026. 11 pages, 6 figures
Abstract:Federated Multi-Label Learning is a distributed paradigm where multiple clients possess heterogeneous multi-label data and perform collaborative learning under privacy constraints without sharing raw data. However, modeling label correlations under heterogeneous distributions remains challenging. Due to client-specific label spaces and varying co-occurrence patterns, correlations learned by individual clients inevitably deviate from the global structure, a phenomenon we term label correlation drift. To address this, we propose FedHarmony, a framework that harmonizes heterogeneous label correlations across clients. It introduces consensus correlation, capturing agreement among other clients and serving as a global teacher to correct biased local estimates. During aggregation, FedHarmony evaluates each client by both data size and correlation quality, assigning weights accordingly. Moreover, we develop an accelerated optimization algorithm for FedHarmony and theoretically establish faster convergence without sacrificing accuracy. Experiments on real-world federated multi-label datasets show that FedHarmony consistently outperforms state-of-the-art methods.
[LG-12] Cost-Aware Learning
链接: https://arxiv.org/abs/2604.28020
作者: Clara Mohri,Amir Globerson,Haim Kaplan,Tomer Koren,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of Cost-Aware Learning, where sampling different component functions of a finite-sum objective incurs different costs. The objective is to reach a target error while minimizing the total cost. First, we propose the Cost-Aware Stochastic Gradient Descent algorithm for convex functions, and derive its cost complexity to attain an error of \epsilon . Furthermore, we establish a lower bound for this setting and provide a subset selection algorithm to further reduce the cost of training. We apply our theoretical insights to reinforcement learning with language models, where the computational cost of policy gradients varies with sequence length. To this end, we introduce Cost-Aware GRPO, an algorithm designed to reduce the cost of policy optimization while preserving performance. Empirical results on 1.5B and 8B LLMs demonstrate that our approach reduces the tokens used in policy optimization by up to about 30% while matching or exceeding baseline accuracy.
[LG-13] Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
链接: https://arxiv.org/abs/2604.28005
作者: Shijin Gong,Kai Ye,Jin Zhu,Xinyu Zhang,Hongyi Zhou,Chengchun Shi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 4 figures
Abstract:Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three approaches have been widely adopted: (i) Proximal policy optimization and advantage actor-critic rely on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. (ii) Group relative policy optimization (GRPO) avoids training a value network by approximating the value function using sample averages. However, GRPO samples a large number of reasoning traces per prompt to achieve accurate value function approximation, making it computationally expensive. (iii) REINFORCE-type algorithms sample only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. In this work, we focus on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization. Comments: 22 pages, 4 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2604.28005 [cs.LG] (or arXiv:2604.28005v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.28005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications
链接: https://arxiv.org/abs/2604.27987
作者: Nghia Bui,Lijing Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fine-tuning pretrained models has become a standard approach to adapting pretrained knowledge to improve the accuracy on new sparse, imbalance datasets. However, issues arise when optimization falls into a collapsed state, where the model gets stuck, leading to degraded performance and unstable training. One possible reason for this is the cancellation of gradients across training examples. To address this problem, we propose a novel algorithm, dynamic scaled gradient descent (\mName), that directly modifies the gradients returned by training examples, specifically, scaling down the gradients of correctly classified examples using a dynamic scaler. This strategy offers both theoretical and empirical advantages in improving training stability. Experiments on a variety of benchmark datasets, spanning multiple tasks and large pretrained models, demonstrate that our method consistently reduces performance variance and surpasses the accuracy of existing approaches.
[LG-15] Differentiable latent structure discovery for interpretable forecasting in clinical time series
链接: https://arxiv.org/abs/2604.27967
作者: Ivan Lerner,Jean Feydy,Alexandre Kalimouttou,Anita Burgun,Francis Bach
类目: Machine Learning (cs.LG)
*备注: This manuscript is under review at BioData Mining
Abstract:Background: Timely, uncertainty-aware forecasting from irregular electronic health records (EHR) can support critical-care decisions, yet most approaches either impute to a grid or sacrifice interpretability. We introduce StructGP, a continuous-time multi-task Gaussian process that couples process convolutions with differentiable structure learning to uncover a sparse, ordered directed acyclic graph (DAG) of inter-variable dependencies while preserving principled uncertainty. We further propose LP-StructGP, which augments StructGP with latent pathways-shared, temporally shifted trajectories inferred via subject-specific coupling filters and a softmax gating mechanism-to capture cross-patient progression patterns. Both models are trained under sparsity and acyclicity constraints (augmented Lagrangian, Adam) using scalable low-rank updates. Results: In simulations, the approach reliably recovers ground-truth graphs (Structural Hamming Distance approaching 0 as cohorts grow) and pathway assignments (high Adjusted Rand Index). On a MIMIC-IV septic shock cohort (n=1,008; norepinephrine, creatinine, mean arterial pressure), StructGP improves short-horizon (6 h) forecasting over independent-task baselines (average RMSE 0.68 [95%CI: 0.63–0.74] vs. 0.88 [0.83-0.94]) and, with 15 additional inputs, markedly outperforms unstructured kernels (0.63 [0.58-0.69] vs. 3.02 [2.85-3.18]) with superior calibration (coverage 0.96 vs. 0.84). On the PhysioNet Challenge (12k patients, 41 variables), StructGP attains competitive accuracy (MAE 3.72e-2) relative to a state-of-the-art graph neural model while maintaining calibrated uncertainty. Conclusion: These results show that structured process convolutions with latent pathways deliver interpretable, scalable, and well-calibrated forecasting for irregular clinical time series.
[LG-16] Calibrating Attribution Proxies for Reward Allocation in Participatory Weather Sensing
链接: https://arxiv.org/abs/2604.27944
作者: Mark C. Ballandies,Michael T. C. Chiu,Claudio J. Tessone
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Large-scale IoT weather sensing networks require incentive mechanisms to sustain participation, yet determining how much value individual data contributions bring to the network remains an open problem. Existing approaches address data quality but not data valuation; in operational meteorology, adjoint-based methods derive value from the forecast model itself but require full data assimilation infrastructure. We propose to utilise differentiable AI weather models to fill this gap and characterise gradient-based attribution on gridded GFS analysis inputs as a candidate value signal, evaluating fidelity, calibration, cost, and gaming vulnerability across more than 400 configurations. Attribution captures near-optimal sensor placement utility with monotonically faithful payments, but can be inflated by adversarial inputs, with detection requiring external baseline data. These findings establish gradient attribution as a computationally validated signal for model-informed reward allocation in participatory weather sensing.
[LG-17] Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification
链接: https://arxiv.org/abs/2604.27936
作者: Eklavya Sarkar,Marius Miron,David Robinson,Gagan Narula,Milad Alizadeh,Ellen Gilsenan-McMahon,Emmanuel Chemla,Olivier Pietquin,Matthieu Geist
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Animals hear and vocalize across frequency ranges that differ substantially from humans, often extending into the ultrasonic domain. Yet most computational bioacoustics systems rely on audio models pre-trained at 16 kHz, restricting their usable bandwidth to the 0-8 kHz baseband and discarding higher-frequency information present in many bioacoustic recordings. We investigate a multi-band encoding framework that decomposes the full spectrum of animal calls into band features and fuses them into a unified representation. Similarity analyses on models show that certain encoders produce decorrelated band embeddings that improve class separation after fusion. Classification experiments on three bioacoustic datasets using eight pre-trained models and five fusion strategies show that fused representations consistently outperform the baseband and time-expansion baselines on two datasets, showing the potential of multi-band methods for full-spectrum encoding of animal calls.
[LG-18] Physical Foundation Models: Fixed hardware implementations of large-scale neural networks
链接: https://arxiv.org/abs/2604.27911
作者: Logan G Wright,Tianyu Wang,Tatsuhiro Onodera,Peter L. McMahon
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Foundation models are deep neural networks (such as GPT-5, Gemini~3, and Opus~4) trained on large datasets that can perform diverse downstream tasks – text and code generation, question answering, summarization, image classification, and so on. The philosophy of foundation models is to put effort into a single, large ( \sim10^12 -parameter) general-purpose model that can be adapted to many downstream tasks with no or minimal additional training. We argue that the rise of foundation models presents an opportunity for hardware engineers: in contrast to when different models were used for different tasks, it now makes sense to build special-purpose, fixed hardware implementations of neural networks, manufactured and released at the roughly 1-year cadence of major new foundation-model versions. Beyond conventional digital-electronic inference hardware with read-only weight memory, we advocate a more radical re-thinking: hardware in which the neural network is realized directly at the level of the physical design and operates via the hardware’s natural physical dynamics – \textitPhysical Foundation Models (PFMs). PFMs could enable orders-of-magnitude advantages in energy efficiency, speed, and parameter density. For \sim10^12 -parameter models, this would both reduce the high energy burden of AI in datacenters and enable AI in edge devices that today are power-constrained to far smaller models. PFMs could also enable inference hardware for models much larger than current ones: 10^15 - or even 10^18 -parameter PFMs seem plausible by some measures. We present back-of-the-envelope calculations illustrating PFM scaling using an optical example – a 3D nanostructured glass medium – and discuss prospects in nanoelectronics and other physical platforms. We conclude with the major research challenges that must be resolved for trillion-parameter PFMs and beyond to become reality. Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.27911 [cs.LG] (or arXiv:2604.27911v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-19] Probabilistic Circuits for Irregular Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2604.27814
作者: Christian Klötergens,Vijaya Krishna Yalavarthi,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注:
Abstract:Joint probabilistic modeling is essential for forecasting irregular multivariate time series (IMTS) to accurately quantify uncertainty. Existing approaches often struggle to balance model expressivity with consistent marginalization, frequently leading to unreliable or contradictory forecasts. To address this, we propose CircuITS, a novel architecture for probabilistic IMTS forecasting based on probabilistic circuits. Our model is flexible in capturing intricate dependencies between time series channels while structurally guaranteeing valid joint distributions. Experiments on four real world datasets demonstrate that CircuITS achieves superior joint and marginal density estimation compared to state of the art baselines.
[LG-20] Hyper-Dimensional Fingerprints as Molecular Representations
链接: https://arxiv.org/abs/2604.27810
作者: Jonas Teufel,Luca Torresi,André Eberhard,Pascal Friederich
类目: Machine Learning (cs.LG)
*备注: Code: this https URL
Abstract:Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.
[LG-21] On the Expressive Power of GNNs to Solve Linear SDPs
链接: https://arxiv.org/abs/2604.27786
作者: Chendi Qian,Christopher Morris
类目: Machine Learning (cs.LG)
*备注:
Abstract:Semidefinite programs (SDPs) are a powerful framework for convex optimization and for constructing strong relaxations of hard combinatorial problems. However, solving large SDPs can be computationally expensive, motivating the use of machine learning models as fast computational surrogates. Graph neural networks (GNNs) are a natural candidate in this setting due to their sparsity-awareness and ability to model variable-constraint interactions. In this work, we study what expressive power is sufficient to recover optimal SDP solutions. We first prove negative results showing that standard GNN architectures fail on recovering linear SDP solutions. We then identify a more expressive architecture that captures the key structure of SDPs and can, in particular, emulate the updates of a standard first-order solver. Empirically, on both synthetic and \textscSdpLib benchmarks of various classes of SDPs, this more expressive architecture achieves consistently lower prediction error and objective gap than theoretically weaker baselines. Finally, using the learned high-quality predictions to warm-start the first-order solver yields practical speedups of up to 80%.
[LG-22] Linear-Core Surrogates: Smooth Loss Functions with Linear Rates for Classification and Structured Prediction
链接: https://arxiv.org/abs/2604.27742
作者: Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear H -consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity O(|\mathscrY|^2) of exact inference (e.g., Viterbi). Empirically, our method achieves a 23 \times speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.
[LG-23] Differential Subgroup Discovery: Characterizing Where Two Populations Differ and Why
链接: https://arxiv.org/abs/2604.27741
作者: Sascha Xu,Jilles Vreeken
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of understanding where two populations differ within a feature space, which we formalize in the concept of a differential subgroup: a subset of individuals from both populations who, despite sharing similar characteristics, exhibit exceptional differences in a target outcome. Differential subgroups reveal the regions of the feature space where population-level gaps are most pronounced and can help practitioners identify the covariate combinations that are structurally responsible for these differences, e.g.~in clinical analysis, model diagnostics, or treatment-effect studies. We introduce a general optimization objective for discovering differential subgroups and establish conditions under which the resulting subgroups admit a causal interpretation of population differences. We propose DiffSub, a gradient-based approach that discovers interpretable differential subgroups in tabular data. Across synthetic benchmarks, medical case studies, model-error analyses, and treatment-effect settings, DiffSub identifies informative subgroups that reveal where population differences arise and why.
[LG-24] Mind the Gap: Structure-Aware Consistency in Preference Learning
链接: https://arxiv.org/abs/2604.27733
作者: Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Preference learning has become the foundation of aligning Large Language Models (LLMs) with human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrogate losses as proxies for the intractable pairwise ranking loss. However, we demonstrate that for the equicontinuous hypothesis sets typical of neural networks, these standard surrogates are theoretically inconsistent, yielding vacuous generalization guarantees. To resolve this, we formulate LLM alignment within a margin-shifted ranking framework. We derive rigorous H -consistency bounds that depend on enforcing a separation margin \gamma . Crucially, we extend this to Structure-Aware H -consistency, introducing a novel objective (SA-DPO) that adapts the margin based on the semantic distance between responses to handle synonyms and hard pairs. Finally, we analyze the trade-off between consistency and model limitations via the Margin-Capacity Profile, proving that heavy-tailed surrogates (such as the Polynomial Hinge family) offer superior consistency guarantees for capacity-bounded models compared to the standard logistic loss used in DPO.
[LG-25] Optimized Deferral for Imbalanced Settings
链接: https://arxiv.org/abs/2604.27723
作者: Corinna Cortes,Anqi Mao,Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.
[LG-26] Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?
链接: https://arxiv.org/abs/2604.27667
作者: Buqing Ou,Frederike Dümbgen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures
Abstract:Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initialization-sensitive search methods typically incur high rollout costs. We propose TFM-S3, a tabular hybrid local-global method for improving global exploration in robot policy learning with limited rollout cost. We interleave high-frequency local updates with intermittent rounds of global search. In each search round, we construct a dynamically updated low-dimensional policy subspace via SVD and perform iterative surrogate-guided refinement within this space. A pretrained tabular foundation model predicts candidate returns from a small context set, enabling large-scale screening with limited rollout cost. Experiments on continuous control benchmarks show that TFM-S3 consistently accelerates early-stage convergence and improves final performance compared to TD3 and population-based baselines under an identical rollout budget. These results demonstrate that foundation models are a powerful new tool for creating sample-efficient policy learning methods for continuous control in robotics.
[LG-27] Green Physics-Informed Machine Learning Models For Structural Health Monitoring
链接: https://arxiv.org/abs/2604.27638
作者: Daisy R Bradley,Elizabeth J Cross
类目: Machine Learning (cs.LG)
*备注: 11 pages, 6 figures
Abstract:Machine learning continues to emerge as an important tool to be utilised within structural engineering and structural health monitoring, due to its ability to accurately and quickly perform both regression and classification tasks. However, a purely data driven approach has its limitations, particularly where we lack data from relevant environmental and operational conditions, a situation that has led to the development of physics-informed machine learners for structural health monitoring. These “grey-box” models take into account the physical insight that an engineer would have about the structure they are modelling and have shown promising results in the structural engineering field among many others. This work compares black and grey-box models through a “green” lens, comparing them in terms of their environmental impact, and investigating how the high extrapolative performance of grey-box models can reduce their runtimes and therefore carbon emissions. The authors aim to develop physics-informed models with reduced computational costs, while maintaining high performance, illustrated through a structural health monitoring case study.
[LG-28] AMGenC: Generating Charge Balanced Amorphous Materials
链接: https://arxiv.org/abs/2604.27613
作者: Yan Lin,Jilin Hu,N. M. Anoop Krishnan,Morten M. Smedskjaer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Amorphous (disordered) materials are solids that have shown great potential in various domains, including energy storage, thermal management, and advanced materials. Unlike crystalline materials that can be described by unit cells containing a few to hundreds of atoms, amorphous materials require larger simulation cells with at least hundreds to thousands of atoms. To advance the design of amorphous materials with desired properties and facilitate the exploration of their vast design space, generative inverse design has emerged as a promising approach. It aims to directly output materials with properties closely aligned with the desired ones using probabilistic generative models conditioned on desired properties, which can be more resource efficient than the traditional trial-and-error approach. However, due to the inherent stochasticity of probabilistic generative models, when element assignments are unconstrained, a large portion of generated materials may be charge unbalanced, and no existing methods can effectively mitigate this limitation. In this work, we propose AMGenC, a new generative inverse design method for amorphous materials that can guarantee the generation of charge balanced samples, with minimal additional computational overhead and without sacrificing inverse design accuracy. AMGenC achieves this through an element noise that gives the generation process a starting point centered around charge balance, and the combination of a per-step soft projection and a final discrete projection for steering the elements toward exact charge balance throughout the generation. We perform extensive experiments on two amorphous materials datasets. Experimental results provide evidence that AMGenC achieves its design goal.
[LG-29] Privacy-Preserving Federated Learning via Differential Privacy and Homomorphic Encryption for Cardiovascular Disease Risk Modeling
链接: https://arxiv.org/abs/2604.27598
作者: Gaurang Sharma,Juha Pajula,Aada Illikainen,Markus Rautell,Noora Lipsonen,Petri Alhainen,Mika Hilvo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Protecting sensitive health data while enabling collaborative analysis is a central challenge in healthcare. Traditional machine learning (ML) requires institutions to pool anonymized patient records, centralizing analytical development and privacy risks at a single site. Privacy-enhancing technologies (PETs), including Differential Privacy (DP) and Homomorphic Encryption (HE), can mitigate these risks. However, they are mainly studied in conventional data-sharing settings and often introduce trade-offs, including reduced model utility, higher computational cost, and increased implementation complexity. Federated Learning (FL) reduces data centralization by enabling institutions to train models locally and share only model updates. Nevertheless, FL does not eliminate privacy risks, as shared parameters or gradients may still reveal sensitive information. Integrating DP or HE into FL can strengthen privacy guarantees, yet their comparative performance and deployment implications in real-world healthcare settings remain unclear. We systematically evaluated DP and HE integration in FL under real-world conditions, comparing them with standard FL and centralized ML (cML) to quantify privacy-utility trade-offs in multi-institutional settings. Using nationwide Swedish healthcare data, we evaluated cardiovascular disease risk prediction using logistic regression (LR) and neural network (NN) learners. FL with HE achieved performance comparable to cML but introduced measurable cryptographic overhead, particularly in the NN implementation. FL with DP incurred lower computational cost; however, LR was more sensitive to calibrated noise than the NN, resulting in greater performance degradation. Our findings provide practical guidance for deploying privacy-preserving FL in fragmented healthcare systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.27598 [cs.LG] (or arXiv:2604.27598v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27598 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] BAss: Symbolic Reasoning in Abstract Dialectical Frameworks
链接: https://arxiv.org/abs/2604.27576
作者: Samuel Pastva,Van-Giang Trinh
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:
Abstract:We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, and preferred interpretations, as well as two-valued and stable models of an ADFs. Our approach is inspired by the recently discovered equivalence between Boolean Networks (BNs) and ADFs by Heyninck et al. (2024) and Azpeitia et al. (2024), significantly extending current BDD-based tools bioLQM, AEON, and adf-bdd. We conducted experiments on a large-scale collection of real-world models from both the BN and ADF communities. Our results show that BAss dramatically outperforms previous BDD-based tools and is competitive (even significantly better in some cases) with state-of-the-art SAT/ASP-based methods, particularly in scenarios involving large solution spaces. Notably, BAss is able to enumerate all fixed points or minimal trap spaces of certain biological networks beyond the reach of existing tools, thereby enabling new analysis and case studies in systems biology. These results highlight the practical relevance of symbolic reasoning for complex real-world applications, particularly in systems biology and formal argumentation.
[LG-31] Learning from a single labeled face and a stream of unlabeled data
链接: https://arxiv.org/abs/2604.27564
作者: Branislav Kveton,Michal Valko
类目: Machine Learning (cs.LG)
*备注: Published at IEEE International Conference on Automatic Face and Gesture Recognition (FG 2013). doi: https://doi.org/10.1109/FG.2013.6553720
Abstract:Face recognition from a single image per person is a challenging problem because the training sample is extremely small. We consider a variation of this problem. In our problem, we recognize only one person, and there are no labeled data for any other person. This setting naturally arises in authentication on personal computers and mobile devices, and poses additional challenges because it lacks negative examples. We formalize our problem as one-class classification, and propose and analyze an algorithm that learns a non-parametric model of the face from a single labeled image and a stream of unlabeled data. In many domains, for instance when a person interacts with a computer with a camera, unlabeled data are abundant and easy to utilize. This is the first paper that investigates how these data can help in learning better models in the single-image-per-person setting. Our method is evaluated on a dataset of 43 people and we show that these people can be recognized 90% of time at nearly zero false positives. This recall is 25+% higher than the recall of our best performing baseline. Finally, we conduct a comprehensive sensitivity analysis of our algorithm and provide a guideline for setting its parameters in practice.
[LG-32] Bayesian policy gradient and actor-critic algorithms
链接: https://arxiv.org/abs/2604.27563
作者: Mohammad Ghavamzadeh,Yaakov Engel,Michal Valko
类目: Machine Learning (cs.LG)
*备注: Published in Journal of Machine Learning Research 17(66):1-53, 2016
Abstract:Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, which tend to have high variance, requiring many samples and resulting in slow convergence. We first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient and a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and can be extended to partially observable problems. On the downside, it cannot exploit the Markov property when the system is Markovian. To address this, we supplement our Bayesian policy gradient framework with a new actor-critic learning model in which a Bayesian class of non-parametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the action-value function as a Gaussian process, allowing Bayes rule to be used to compute the posterior distribution over action-value functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between action-values yield closed-form expressions for the posterior of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, on a number of reinforcement learning problems.
[LG-33] Online semi-supervised perception: Real-time learning without explicit feedback CVPR2010
链接: https://arxiv.org/abs/2604.27562
作者: Branislav Kveton,Michal Valko,Matthai Phillipose,Ling Huang
类目: Machine Learning (cs.LG)
*备注: IEEE Computer Vision and Pattern Recognition Workshop on Online Learning for Computer Vision (CVPR 2010 OLCV)
Abstract:This paper proposes an algorithm for real-time learning without explicit feedback. The algorithm combines the ideas of semi-supervised learning on graphs and online learning. In particular, it iteratively builds a graphical representation of its world and updates it with observed examples. Labeled examples constitute the initial bias of the algorithm and are provided offline, and a stream of unlabeled examples is collected online to update this bias. We motivate the algorithm, discuss how to implement it efficiently, prove a regret bound on the quality of its solutions, and apply it to the problem of real-time face recognition. Our recognizer runs in real time, and achieves superior precision and recall on 3 challenging video datasets.
[LG-34] Diagnosing Capability Gaps in Fine-Tuning Data
链接: https://arxiv.org/abs/2604.27547
作者: Saeid Asgari Taghanaki,Rakshanda Agarwal,Bruce Sun,Rohan Jha,Elias Stengel-Eskin,Sara Malvar,Rui Ying,Yifei Xu,Guilherme Potje,Tusher Chakraborty,Leonardo de Oliveira Nunes,Ranveer Chandra,Emre Kiciman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fine-tuning large language models (LLMs) for domain-specific tasks requires training datasets that comprehensively cover the target capabilities a practitioner needs. Yet identifying which capabilities a dataset fails to support, and doing so before an expensive fine-tuning run, remains a largely unsolved problem. We introduce GoalCover, a framework that helps practitioners systematically detect capability gaps in fine-tuning datasets through interactive goal decomposition and automated coverage assessment. GoalCover guides a practitioner through structured decomposition of a high-level goal into atomic, independently evaluable subgoals; assigns each training sample an LLM-based alignment score against every subgoal; and surfaces missing capabilities through automated analysis of low-scoring sample explanations. We validate the framework along two complementary axes. First, through controlled corruption experiments across three domains (medical QA, legal summarization, code generation), we show that GoalCover reliably distinguishes targeted from non-targeted capability impacts: target subgoals degrade by 25.6% on average versus 2.1% for non-target subgoals (Cohen’s d=1.24). Second, we demonstrate downstream utility on a financial-summarization Reinforcement Fine-Tuning (RFT) task with Qwen-3-14B: training on GoalCover-filtered data improves the LLM-judge reward from 3.77 to 4.12 (out of 5) over the unfiltered baseline, and combining filtered data with goal-conditioned synthetic samples yields the strongest result (4.20). The two results together show that GoalCover works as a practical pre-fine-tuning diagnostic: it detects capability gaps and produces concrete signal for closing them.
[LG-35] Low Rank Adaptation for Adversarial Perturbation
链接: https://arxiv.org/abs/2604.27487
作者: Han Liu,Shanghao Shi,Yevgeniy Vorobeychik,Chongjie Zhang,Ning Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network layers using low-rank matrices. Since the generation of adversarial examples is an optimization process analogous to model training, this naturally raises the question: Do adversarial perturbations exhibit a similar low-rank structure? In this paper, we provide both theoretical analysis and extensive empirical investigation across various attack methods, model architectures, and datasets to show that adversarial perturbations indeed possess an inherently low-rank structure. This insight opens up new opportunities for improving both adversarial attacks and defenses. We mainly focus on leveraging this low-rank property to improve the efficiency and effectiveness of black-box adversarial attacks, which often suffer from excessive query requirements. Our method follows a two-step approach. First, we use a reference model and auxiliary data to guide the projection of gradients into a low-dimensional subspace. Next, we confine the perturbation search in black-box attacks to this low-rank subspace, significantly improving the efficiency and effectiveness of the adversarial attacks. We evaluated our approach across a range of attack methods, benchmark models, datasets, and threat models. The results demonstrate substantial and consistent improvements in the performance of our low-rank adversarial attacks compared to conventional methods. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2604.27487 [cs.LG] (or arXiv:2604.27487v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27487 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] oward Scalable SDN for LEO Mega-Constellations: A Graph Learning Approach
链接: https://arxiv.org/abs/2604.27478
作者: Sivaram Krishnan,Bassel Al Homssi,Zhouyou Gu,Jihong Park,Sung-Min Oh,Jinho Choi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Terrestrial network limitations drive the integration of non-terrestrial networks (NTNs), notably mega-constellations comprising thousands of low Earth orbit (LEO) satellites. While these satellites act as interconnected network switches via inter-satellite links (ISLs), their massive scale creates severe bottlenecks for network management. To address this, we propose a scalable, hierarchical software-defined networking (SDN) framework. Our architecture leverages graph neural networks (GNNs) to compactly represent the constellation topology, and Koopman theory to linearize nonlinear dynamics. Specifically, a Graph Koopman Autoencoder (GKAE) forecasts spatio-temporal behavior within a linear subspace for each orbital shell. A central SDN controller then aggregates these shell-level predictions for globally coordinated control. Simulations on the Starlink constellation demonstrate that our approach achieves at least a 42.8% improvement in spatial compression and a 10.81% improvement in temporal forecasting compared to established baselines, all while utilizing a significantly smaller model footprint.
[LG-37] ChipLingo: A Systematic Training Framework for Large Language Models in EDA
链接: https://arxiv.org/abs/2604.27415
作者: Lei Li,Xingwen Yu,Jianguo Ni,Junxuan Zhu,Jieqiong Zhang,Jian Zhao,Zhi Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the rapid advancement of semiconductor technology, Electronic Design Automation (EDA) has become an increasingly knowledge-intensive and document-driven engineering domain. Although large language models (LLMs) have shown strong general capabilities, applying them directly to EDA remains challenging due to limited domain expertise, cross-tool knowledge confusion, and degraded retrieval-augmented generation (RAG) performance after domain training. To address these issues, this paper presents ChipLingo, a systematic training pipeline for domain-adapted LLMs tailored to EDA scenarios. ChipLingo consists of three stages: domain corpus construction with multi-source data curation and QA augmentation, domain-adaptive pretraining with comparisons of different parameter training strategies, and instruction alignment with RAG scenario training under diverse retrieval conditions. We also curate an internal benchmark, EDA-Bench, covering representative EDA tool scenarios, with plans for public release. Experiments show that ChipLingo-8B achieves 59.7% accuracy on EDA-Bench, outperforming the same-scale base model and some larger general-purpose models. ChipLingo-32B reaches 70.02%, approaching leading closed-source commercial models. Further analysis shows that QA augmentation improves domain performance, Partial FT offers a better balance between adaptation and general capability retention than LoRA, and explicit RAG scenario training mitigates the decline in retrieval utilization after domain training. These results demonstrate the practical value of systematic domain training for knowledge-intensive EDA tasks and provide a foundation for future EDA agents and external-knowledge-driven systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.27415 [cs.LG] (or arXiv:2604.27415v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-38] Detecting is Easy Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
链接: https://arxiv.org/abs/2604.27411
作者: Haiyang Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Visual model-based reinforcement learning (MBRL) agents can perform well on the training distribution, but often break down once the test environment shifts. In visual MBRL, recognizing that a shift has occurred is often the easier part; the harder part is turning that recognition into useful action-level correction. We study several ways of responding to shift, including planning penalties, direct fine-tuning, global residual correction, and coarse gating. In our experiments, these approaches either do not improve closed-loop control or hurt in-distribution (ID) performance. Based on these negative results, we propose JEPA-Indexed Local Expert Growth. The method uses a frozen JEPA representation only for problem indexing, while cluster-specific residual experts add local action corrections on top of the original controller. The baseline controller itself is not modified. Using paired-bootstrap evaluation, we find that the original naive-preference variant is not stable under stricter testing. In contrast, the harder-pair variant produces statistically significant OOD improvements on all four evaluated shift conditions while preserving ID performance. The learned experts also remain useful when the same shift is encountered again, which supports the view of adaptation as incremental knowledge growth rather than repeated full retraining. We further show that automatic ID rejection can be achieved with simple density models, whereas fine-grained discrimination among OOD sub-families is limited by the representation. Overall, the results indicate that, for visual MBRL under distribution shift, the main challenge is not simply noticing that the environment has changed, but applying the right local action correction after the change has been recognized.
[LG-39] Stable but Wrong: An Inference Limit in Galactic Archaeology
链接: https://arxiv.org/abs/2604.27368
作者: Zhipeng Zhang
类目: Machine Learning (cs.LG); Astrophysics of Galaxies (astro-ph.GA)
*备注:
Abstract:Statistical inference in observational science typically relies on a fundamental assumption: as sample size increases and uncertainties decrease, the inferred results should converge to the true physical quantities. This assumption underpins the notion that big data lead to more reliable conclusions. In Galactic archaeology, stellar ages inferred from spectroscopic surveys are widely used to reconstruct the formation history of the Milky Way disk. The age metallicity relation (AMR) and its derived formation timescale are often regarded as key physical diagnostics of early disk evolution. This interpretation carries an implicit premise: that observational quality does not introduce systematic bias into age inference. Here we show that this premise may fail. Using a large sample of subgiant stars, we identify a region in the observational quality parameter space (signal-to-noise ratio and parallax precision) where the inferred formation timescale exhibits a systematic offset of 0.5-1 Gyr relative to an independent asteroseismic reference, while the statistical uncertainties remain small, thus producing a stable-but-wrong inference state.
[LG-40] A Short Note on Batch-efficient Divide-and-Conquer Algorithm for EigenDecomposition
链接: https://arxiv.org/abs/2604.27325
作者: Yue Song
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. One crucial bottleneck limiting its usage is the expensive computation cost, particularly for a mini-batch of matrices in deep neural networks. Our previous work proposed a dedicated QR-based ED algorithm for batched small matrices (dim 32 ). This short paper targets the limitation and proposes a batch-efficient Divide-and-Conquer based ED algorithm for larger matrices. The numerical test shows that for a mini-batch of matrices whose dimensions are smaller than 64 , our method can be much faster than the Pytorch SVD function.
[LG-41] REBENCH: A Procedural Fair-by-Construction Benchmark for LLM s on Stripped-Binary Types and Names (Extended Version)
链接: https://arxiv.org/abs/2604.27319
作者: Jun Yeon Won,Xin Jin,Shiqing Ma,Zhiqiang Lin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: This is an extended version of our paper, which appears in AIWare 2026
Abstract:Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of LLMs and the result demonstrates difficulties in complex tasks.
[LG-42] he Likelihood Ratio Wall: Structural Limits on Accurate Risk Assessment for Rare Violence
链接: https://arxiv.org/abs/2604.27282
作者: Marco Pollanen
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 16 pages, 2 figures, 8 tables. Accepted to the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
Abstract:Pretrial risk assessment tools are used on over one million U.S. defendants each year, yet their use for predicting rare violent re-offense faces a basic statistical barrier. We derive a universal precision bound – the Likelihood Ratio Wall – showing that when violent re-arrest rates are low (2-5%), achieving even a 50% hit rate among people labeled “high risk” (positive predictive value, or PPV) would require tools far more discriminative than current instruments appear to be. For rare outcomes, a tool can have respectable-looking performance metrics and still be wrong most of the time it flags someone as “high risk for violence.” We show that post-hoc score recalibration cannot solve this problem because it does not improve the tool’s underlying ability to separate true positives from false positives. We further prove a Surveillance Ceiling: when over-policing inflates recorded “risk factors” among those who would not re-offend, the maximum achievable precision is structurally lower for over-policed groups, even at equal offense rates. We translate these results into the Number Needed to Detain (how many people must be detained to prevent one violent offense), and propose that risk reports should communicate this uncertainty explicitly. Our findings suggest that for rare violent outcomes, debates about fairness metrics alone are incomplete: under current data regimes, the available features may not support high-confidence individualized detention decisions.
[LG-43] Predicting Covariate-Driven Spatial Deformation for Nonstationary Gaussian Processes
链接: https://arxiv.org/abs/2604.27280
作者: Minghao Gu,Weizhi Lin,Qiang Huang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Nonstationary Gaussian processes (GPs) are essential for modeling complex, locally heterogeneous spatial data. A common modeling approach is the spatial deformation method that warps the domain to recover isotropy. However, this static method does not account for changes in spatial correlation induced by covariates, limiting its ability to predict nonstationary GPs under new covariate conditions. To enable predictive modeling of the deformation method, we propose to model the spatial deformation as a function of covariates. The spaces of diffeomorphic deformations and Euclidean covariate vectors are connected by characterizing deformations as generated by velocity fields living in a Lie algebra. To overcome the estimation instability caused by high-order interactions between multiple covariates in a general Lie algebra, we prove that those interactions can be truncated with a moderate physical assumption. Based on the theoretical results, a concise functional form of deformations driven by multiple covariates can be established, and an efficient estimation-inference algorithm is developed for out-of-sample nonstationary GP prediction with limited covariate-deformation sample pairs. The effectiveness and generalizability of the method are demonstrated on a simulation study and two case studies, in the fields of manufacturing and geostatistics, respectively. Subjects: Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2604.27280 [cs.LG] (or arXiv:2604.27280v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.27280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors and the Model Deploys Fully On-Device
链接: https://arxiv.org/abs/2604.27279
作者: Nazar Kozak
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 4 figures, 9 tables. Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Abstract:Audio-based stuttering systems to date have been trained for detection – what disfluency is present now – leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second clips) to predict whether the next contiguous clip contains any disfluency. (1) Severity-selective precursor signal: on the episode-grouped test set, aggregate preblock AUC is modest (0.581 [0.542, 0.619]), but stratifying by upcoming event type reveals concentration on clinically severe events – blocks 0.601 [0.554, 0.651] and sound repetitions 0.617 [0.567, 0.667] both exclude chance, while fillers (0.45) and word repetitions (0.49) are at chance. The aggregate objective converges to a severity-selective predictor because severe events carry prosodic precursors; fillers do not. (2) Cross-population transfer: without fine-tuning, the same checkpoint applied to 1,024 pediatric Children-Who-Stutter utterances (FluencyBank Teaching) attains AUC 0.674 detection and 0.655 prediction; DisfluencySpeech and LibriStutter reach 0.58-0.60 AUC. (3) Deployable on-device: lossless export to CoreML (1.19 MB), ONNX (40 KB), TFLite. Neural-Engine latency per 3 s window: 0.25 ms (iPhone 17 Pro Max, A19 Pro) to 0.55 ms (iPhone SE 3rd-gen and M1 Max). A 4 Hz streaming simulation uses 0.54% of the real-time budget. Platt-calibrated outputs (test ECE 0.010, from 0.177 raw). Five negative ablations – output-level Future-Guided Learning, multi-clip GRU, time-axis concatenation, asymmetric focal loss, direct block-targeted training – none improved over the vanilla baseline. Comments: 8 pages, 4 figures, 9 tables. Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) ACMclasses: I.5.4; I.2.7 Cite as: arXiv:2604.27279 [cs.SD] (or arXiv:2604.27279v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2604.27279 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nazar Kozak [view email] [v1] Thu, 30 Apr 2026 00:30:28 UTC (53 KB) Full-text links: Access Paper: View a PDF of the paper titled Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device, by Nazar KozakView PDFHTML (experimental)TeX Source view license Current browse context: cs.SD prev | next new | recent | 2026-04 Change to browse by: cs cs.LG eess eess.AS References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-45] AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data
链接: https://arxiv.org/abs/2604.27266
作者: Ali Jaberi(1),Yonatan Kurniawan(2),Robert Black(1),Shayan Mousavi M.(1),Kabir Verma(3),Zoya Sadighi(1),Santiago Miret(4),Jason Hattrick-Simpers(2) ((1) Clean Energy Innovation Research Center, National Research Council Canada, Mississauga, ON, Canada, (2) Department of Material Science and Engineering, University of Toronto, Toronto, ON, Canada, (3) Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada, (4) Lila Sciences, San Francisco, CA, USA)
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:This paper introduces AutoREC, an open-source Python package for developing reinforcement learning (RL) agents to automatically generate equivalent circuit models (ECMs) from electrochemical impedance spectroscopy (EIS) data. While ECMs are a standard framework for interpreting EIS data, traditional identification is typically based on manual trial-and-error, which requires domain experts and limits scalability, particularly in autonomous experimental pipelines such as self-driving laboratories. AutoREC addresses this challenge by formulating ECM construction as a sequential decision-making problem within a Markov Decision Process framework. It implements a Double Deep Q-Network with prioritized experience replay, along with a dedicated dead-loop mitigation strategy, to efficiently explore a complex action space for circuit generation. To demonstrate the capabilities of the platform, we trained an RL agent using AutoREC and evaluated its strengths and limitations across diverse datasets, while also discussing possible strategies to mitigate these limitations in future agent designs. The trained agent achieved a success rate exceeding 99.6% on synthetic datasets and demonstrated strong generalization to unseen experimental EIS data from batteries, corrosion, oxygen evolution reaction, and CO _2 reduction systems. These results position AutoREC as a promising platform for adaptive and data-driven ECM generation, with potential for integration into automated electrochemical workflows.
[LG-46] Analytical Correction for Subsampling Bias in Drifting Models
链接: https://arxiv.org/abs/2604.27239
作者: Jiaru Zhang,Zeyun Deng,Juanwu Lu,Ziran Wang,Ruqi Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Drifting models are capable one-step generative models trained to follow a drifting field. The field combines attractive and repulsive softmax-weighted centroids over the data and current-generator distributions. In practice, only a minibatch of n samples from each distribution is available, and each centroid is approximated by an empirical estimate. In this paper, we begin by showing that the minibatch centroid is in general a biased estimator of the target centroid, with a pointwise O(1/n) bias arising from softmax self-normalization. Correcting this bias requires the expectation over the full distribution, which is intractable. We instead approximate the leading bias term from in-batch statistics and propose Analytical Bias Correction (ABC), a closed-form plug-in adjustment. We prove that ABC reduces the bias from O(1/n) to O(1/n^2) , introduces no first-order increase in total variance, and preserves convex-hull containment of the corrected centroid. In practice, ABC requires only two additional lines of code and has negligible wall-time overhead under compiled execution. Toy experiments confirm the theoretical O(1/n) and O(1/n^2) scaling. On CIFAR-10, ABC reduces FID and trains faster, with the largest gains at small n , where the bias is most significant.
[LG-47] Remaining Useful Life Estimation for Turbofan Engines: A Comparative Study of Classical CNN and LSTM Approaches
链接: https://arxiv.org/abs/2604.27234
作者: Astitva Goel,Samarth Galchar,Sumit Kanu
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 algorithms, 7 figures, 4 tables
Abstract:Remaining Useful Life (RUL) estimation is a critical component of Prognostics and Health Management (PHM), enabling proactive maintenance scheduling and reducing unplanned failures in industrial equipment. This paper presents a comparative study of machine learning approaches for RUL estimation on the NASA C-MAPSS turbofan engine dataset: classical baselines (Ridge Regression, Polynomial Ridge, and XGBoost), a 1D Convolutional Neural Network (CNN), and a Long Short-Term Memory (LSTM) network. All models are evaluated on the FD001 and FD003 subsets under an identical preprocessing pipeline to ensure a fair comparison. Among raw-sequence models, the LSTM achieves RMSE of 14.93 and 14.20 on FD001 and FD003 respectively, outperforming the deep LSTM reported by Zheng et al.~\citepaper (RMSE 16.14 and 16.18) despite using a simpler single-layer architecture. The 1D CNN achieves RMSE of 16.97 on FD001 and 15.68 on FD003, demonstrating competitive performance on FD003 while producing more conservative RUL predictions on FD001. Ridge Regression is evaluated on raw and engineered features, while other classical models use only engineered inputs. XGBoost achieves an RMSE of 13.36 on FD003, highlighting the competitiveness of nonlinear modeling.
[LG-48] Context-Aware Graph Attention for Unsupervised Telco Anomaly Detection
链接: https://arxiv.org/abs/2604.27172
作者: Sara Malacarne,Eirik Hoel-Høiseth,Erlend Aune,David Zsolt Biro,Massimiliano Ruocco
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose C-MTAD-GAT, an \emphunsupervised, \emphcontext-aware graph-attention model for anomaly detection in multivariate time series from mobile networks. C-MTAD-GAT combines graph attention with lightweight context embeddings, and uses a deterministic reconstruction head and multi-step forecaster to produce anomaly scores. Detection thresholds are calibrated \emphwithout labels from validation residuals, keeping the pipeline fully unsupervised. On the public TELCO dataset, C-MTAD-GAT consistently outperforms MTAD-GAT and the Telco-specific DC-VAE, two state-of-the-art baselines, in both event-level and pointwise F1, while triggering substantially fewer alarms. C-MTAD-GAT is also deployed in the Core network of a national mobile operator, demonstrating its resilience in real industrial settings.
[LG-49] Distributional Alignment Games for Answer-Level Fine-Tuning
链接: https://arxiv.org/abs/2604.27166
作者: Mehryar Mohri,Jon Schneider,Yifan Wu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We focus on the problem of \emphAnswer-Level Fine-Tuning (ALFT), where the goal is to optimize a language model based on the correctness or properties of its final answers, rather than the specific reasoning traces used to produce them. Directly optimizing answer-level objectives is computationally intractable due to the need to marginalize over the vast space of latent reasoning paths. To overcome this, we propose a general game-theoretical framework that lifts the problem to a \emphDistributional Alignment Game. We formulate ALFT as a two-player game between a Policy (the generator) and a Target (an auxiliary distribution). We prove that the Nash Equilibrium of this game corresponds exactly to the solution of the original answer-level optimization problem. This variational perspective transforms the intractable marginalization problem into a tractable projection problem. We demonstrate that this framework unifies recent approaches to diversity and self-improvement (coherence) and provide efficient algorithms compatible with Group Relative Policy Optimization (GRPO), such as Coherence-GRPO, yielding significant complexity gains in mathematical reasoning tasks.
[LG-50] Generalizing the Geometry of Model Merging Through Frechet Averag es
链接: https://arxiv.org/abs/2604.27155
作者: Marvin F. da Silva,Mohammed Adnan,Felix Dangel,Sageev Oore
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging aims to combine multiple models into one without additional training. Naïve parameter-space averaging can be fragile under architectural symmetries, as their geometry does not take them into account. In this work we show that not only the geometry, but also the averaging procedure itself, must be symmetry-invariant to achieve symmetry-aware merges. Consequently, we propose a general solution: merging as Fréchet averaging, i.e., selecting parameters that minimize a sum of geodesic distances on an appropriate manifold. In this view, the key design choice is the overall geometry, i.e., the choice of metric, manifold, and distance approximation, that determines what it means for two models to be “close”. We show that Fréchet averaging, combined with simplifying assumptions, contains Fisher merging. Building on this, we examine the particular case of low-rank adapters (LoRA), whose symmetries induce a distinct geometry: that of a quotient manifold. We outline the limitations of current LoRA merging methods, propose a practical algorithm for this setting, and show how they compare with other commonly used approaches.
[LG-51] Better Models Faster Training: Sigmoid Attention for single-cell Foundation Models
链接: https://arxiv.org/abs/2604.27124
作者: Vijay Sadashivaiah,Georgios Dasoulas,Judith Mueller,Soumya Ghosh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives ( \leq 0.25 ) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax’s dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at this https URL
[LG-52] AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
链接: https://arxiv.org/abs/2604.27089
作者: Ahan Gupta,Zhihao Wang,Neel Dani,Masahiro Tanaka,Olatunji Ruwase,Minjia Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: 13 pages, 9 figures, 1 table
Abstract:Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context training, instead focusing on optimizations for models with large parameter counts through ZeRO-3/FSDP, Tensor and Pipeline parallelism. This forces users to rewrite LLM training libraries to incorporate compositions of various complex long-context optimizations, such as sequence-parallelism, to training pipelines; a process that requires in-depth expertise, reducing developer productivity. To tackle these challenges, we introduce AutoSP: the first automated solution to automatically optimize LLM training for longer-contexts. AutoSP compiles models and applies a targeted set of optimizations: automated sequence parallelism, and long-context aware activation-checkpointing, to drastically enhance LLM trainability at negligible cost to throughput. Our evaluation demonstrates AutoSP’s capability on both NVIDIA and AMD hardware, increasing training contexts by upto 2.7 \times and 2.5 \times respectively over competitive hand-written baseline at negligible cost to runtime performance.
[LG-53] Co-Evolving Policy Distillation
链接: https://arxiv.org/abs/2604.27083
作者: Naibin Gu,Chenxu Yang,Qingyi Si,Chuanyu Qin,Dingyu Yao,Peng Fu,Zheng Lin,Weiping Wang,Nan Duan,Jiaqi Wang
类目: Machine Learning (cs.LG)
*备注: Work in progress
Abstract:RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert’s ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.
[LG-54] Learning to Forget: Continual Learning with Adaptive Weight Decay
链接: https://arxiv.org/abs/2604.27063
作者: Aditya A. Ramesh,Alex Lewandowski,Jürgen Schmidhuber
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint version
Abstract:Continual learning agents with finite capacity must balance acquiring new knowledge with retaining the old. This requires controlled forgetting of knowledge that is no longer needed, freeing up capacity to learn. Weight decay, viewed as a mechanism for forgetting, can serve this role by gradually discarding information stored in the weights. However, a fixed scalar weight decay drives this forgetting uniformly over time and uniformly across all parameters, even when some encode stable knowledge while others track rapidly changing targets. We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.
[LG-55] Cross-Subject Generalization for EEG Decoding: A Survey of Deep Learning Methods
链接: https://arxiv.org/abs/2604.27033
作者: Taida Li,Yujun Yan,Fei Dou,Wenzhan Song,Xiang Zhang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted manuscript in Progress in Biomedical Engineering
Abstract:Deep learning for cross-subject EEG decoding is hindered by high inter-subject variability, which introduces a severe domain shift between training and unseen test subjects. This survey presents a comprehensive review of deep learning methodologies specifically engineered to address this cross-subject generalization challenge. To ground this analysis, we formalize the cross-subject setting as a multi-source domain problem and delineate the rigorous, subject-independent evaluation protocols required for valid assessment. Central to this survey is a systematic taxonomy of the current literature into discrete methodological families, including feature alignment, adversarial learning, feature disentanglement, and contrastive learning. We conclude by examining three critical elements for advancing robust, real-world decoding: the theoretical limitations of current methodologies, the structural value of subject identity, and the emergence of EEG foundation models.
[LG-56] LLM -Guided Runtime Parameter Optimization for Energy-Efficient Model Inference
链接: https://arxiv.org/abs/2604.27032
作者: Katelyn Crumpacker,Dimitrios Nikolopoulos
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures
Abstract:Large Language Models (LLMs) have become an integral part of many real-world workflows. However, LLMs consume a lot of energy, which becomes a large concern in the scale of the demand for these tools. As LLMs become integrated into different workflows, different applications have arisen to deal with the challenge of running inference for these tools. This raises another issue of choosing the runtime parameter values for these services in order to minimize the energy consumption. Oftentimes this requires deep knowledge of the application or traditional optimization methods that can take days to find optimal values. In this work, we created a human-in-the-loop flow with LLM-assisted runtime parameter optimization in order to solve this issue. With human-created, specific feedback prompting methods, chat-based LLMs can iteratively find energy-efficient inference parameters faster than traditional search methods. LLMs can also tailor their solutions to different hardware setups and easily take into account other system constraints. The enhanced prompt template was able to converge below the threshold at an average of 3.4 prompts compared to the baseline, which converged in an average of 5.2 prompts, and consistently achieved lower final energy per token. The enhanced prompt template also outperformed Sobol sampling in convergence speed.
[LG-57] Fidelity Diversity and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation
链接: https://arxiv.org/abs/2604.27014
作者: Guillermo Iglesias,Gema Bello-Orgaz,María Navas-Loro,Cristian Ramirez-Atencia,Mercè Salvador Robert,Enrique Baca-Garcia
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages, 1 figure, 1 table
Abstract:The scarcity of high-quality annotated medical data, particularly in mental health, poses a significant bottleneck for training robust machine learning models. Privacy regulations restrict data sharing, making synthetic data generation a promising alternative. The use of Large Language Models (LLMs) in a data augmentation pipeline could be leveraged as an alternative in this field. In the proposed methodology, DeepSeek-R1, OpenBioLLM-Llama3 and Qwen 3.5 are used to generate synthetic mental health evaluation reports conditioned on specific International Classification of Diseases, Tenth Revision (ICD-10) codes. Because naive text generation can lead to mode collapse or privacy breaches (memorization), a comprehensive evaluation framework is introduced. The generated diagnostic texts are assessed across three dimensions: semantic fidelity, lexical diversity, and privacy/plagiarism. The results demonstrate that all models can generate clinically coherent, diverse, and privacy-safe synthetic reports, significantly expanding the available training data for clinical natural language processing tasks without compromising patient confidentiality.
[LG-58] EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures
链接: https://arxiv.org/abs/2604.27004
作者: Gustav Olaf Yunus Laitinen-Fredriksson Lundstrom-Imanov,Taner Yilmaz
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 9 pages, 6 figures, 10 tables. Submitted to IEEE Internet of Things Journal
Abstract:We propose EdgeSpike, a co-designed spiking neural network (SNN) framework for autonomous low-power sensing in edge Internet of Things (IoT) architectures. EdgeSpike unifies (i) a hybrid surrogate-gradient and direct-encoding training pipeline, (ii) a hardware-aware neural architecture search (NAS) bounded by per-inference energy and memory budgets, (iii) an event-driven runtime targeting Intel Loihi 2, SpiNNaker 2, and commodity ARM Cortex-M microcontrollers with custom spike-sparse SIMD kernels, and (iv) a lightweight local plasticity rule enabling continual on-device adaptation without backpropagation. The framework is evaluated across five sensing tasks (keyword spotting, vibration-based machine fault detection, surface electromyography gesture recognition, 77 GHz radar human-activity classification, and structural-health acoustic-emission monitoring) on three hardware targets. EdgeSpike achieves a mean classification accuracy of 91.4%, within 1.2 percentage points (pp) of strong INT8 convolutional neural network (CNN) baselines (mean 92.6%), while reducing energy per inference by 18x to 47x on neuromorphic hardware (mean 31x) and by 4.6x to 7.9x on Cortex-M (mean 6.1x). End-to-end latency remains at or below 9.4 ms across all 15 task-hardware configurations. A seven-month, 64-node wireless field deployment confirms a 6.3x extension in projected battery lifetime (from 312 to 1978 days at 2 Wh per node) and bounded accuracy degradation under seasonal drift (0.7 pp with on-device adaptation versus 2.1 pp without). Hardware-aware NAS evaluates 8400 candidates and yields a 12-point Pareto front. EdgeSpike will be released as open source with reproducible training pipelines, hardware-portable runtimes, and benchmark suites.
[LG-59] State-Dependent Lyapunov Method for Rank-1 Matrix Factorization
链接: https://arxiv.org/abs/2604.26993
作者: Jaehong Moon
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study gradient descent for rank-1 matrix factorization through a certificate-based viewpoint. The central object is a parameterized quadratic certificate I(\delta;,\cdot) whose level sets shrink along the dynamics, thereby inducing a monotone state parameter \delta_t . In the certified regime, this mechanism yields convergence to a global minimizer; in the post-critical regime, it forces trajectories toward a terminal balanced manifold. To explain the origin of these certificates, we formulate a state-dependent Lyapunov framework based on structural axioms. Within this framework, the scalar certificate is uniquely determined, and the same local Lagrange analysis constrains the signal and noise blocks of rank-1 extensions. Thus, the certificates arise from the monotonicity structure of the dynamics, rather than from ad hoc algebraic constructions. We also provide numerical evidence beyond the proved cases. For the 2-dimensional rank-1 approximation problem X=\mathrmdiag(1,\sigma) with \sigma\in(0,1) , the experiments are consistent with the existence of a C^1 admissible certificate branch. For the quartic-augmented scalar loss \frac12(ab-1)^2+\mu(ab-1)^4 , the same scalar certificate remains predictive for several values of \mu after choosing an empirical threshold. These experiments suggest that the state-dependent Lyapunov method may extend beyond the settings proved in this paper. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2604.26993 [math.NA] (or arXiv:2604.26993v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2604.26993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] Monitoring Neural Training with Topology: A Footprint-Predictable Collapse Index
链接: https://arxiv.org/abs/2604.26984
作者: Alexander Kalinowski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Representational collapse, where embeddings become anisotropic and lose multi-scale structure, can erode downstream performance long before performance metrics react. We propose an online, topology-aware monitor for evolving neural representations that couples Modular Morse Homology Maintenance (MMHM) with a composite Collapse Index (CI). Instead of rebuilding complexes each epoch, we apply sparse edits at a fixed scale and maintain a discrete Morse matching, yielding fast, incremental updates. Across LLM fine-tuning and temporal KGE training, CI provides a low-latency early-warning signal suitable for in-training interventions. Code and experimental scripts will be released publicly
[LG-61] MAEO: Multiobjective Animorphic Ensemble Optimization for Scalable Large-scale Engineering Applications
链接: https://arxiv.org/abs/2604.26973
作者: Omer F. Erdem,Dean Price,Paul Seurin,Majdi I. Radaideh
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 33 pages, 9 figures, 5 tables, under peer review
Abstract:Multiobjective optimization remains challenging for many scientific and engineering problems due to the need to balance convergence, diversity, and computational efficiency across high-dimensional objective landscapes. This work presents the Multiobjective Animorphic Ensemble Optimization (MAEO) framework, a parallelizable ensemble strategy that unifies state-of-the-art evolutionary algorithms within an island-based architecture, overcoming the limitations of relying on a single optimizer, as implied by the No Free Lunch theorem. MAEO uses a parameter-free hypervolume indicator for island performance assessment and a strict Pareto-rank-based individual scoring formulation that incorporates crowding distance and nadir-point proximity to ensure consistent selection pressure within each front. The framework is initiated using four algorithms (NSGA-III, CTAEA, AGEMOEA2, SPEA2) and evaluated through extensive benchmarking on 12 DTLZ/ZDT functions under 36 dimensionality settings using Wilcoxon signed-rank tests with both hypervolume and inverse generational distance metrics. Results show that MAEO achieves balanced convergence-diversity performance, outperforming or matching some of the leading multiobjective optimization algorithms across different benchmark problems. To demonstrate practical applicability, MAEO is applied to the equilibrium-cycle optimization of a small modular nuclear reactor. Eight discrete design variables (and three objectives (levelized cost of electricity, peak soluble boron concentration, fuel cycle length) are optimized under two safety constraints. The algorithm carried out roughly 40000 evaluations using computer simulations. MAEO identifies core designs that lower both the levelized cost of electricity and the peak boron concentration, while preserving fuel cycle length and meeting all safety constraints.
[LG-62] How Hard Is Continuous Clustering? Lower Bounds from the Existential Theory of the Reals
链接: https://arxiv.org/abs/2604.26972
作者: Angshul Majumdar
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:
Abstract:This paper studies the computational difficulty of clustering problems that are defined directly on a continuous probability density. Rather than working with finite samples, we assume the density is given as a polynomial and ask whether it contains certain cluster structures. Four natural questions are examined. First, do there exist several points with high density that are far apart from each other. Second, do two high density points have a midpoint with low density, creating a valley between them. Third, does the region where the density is above a threshold have at least a given number of separate connected pieces. Fourth, does that same region contain a hole, meaning a loop that cannot be shrunk to a point. We prove that the first two problems, separated points and valley detection, are exactly as hard as the existential theory of the reals, a complexity class that contains NP and is believed to be strictly larger. In contrast, the topological problems of counting connected pieces and detecting holes are at least as hard as the existential theory of the reals, but their exact complexity remains open. Placing them inside that class would need a major advance in real algebraic geometry. These results give the first rigorous classification of exact continuous clustering inside the real polynomial hierarchy. They also show that even basic clustering criteria are not NP complete unless unexpected collapses occur.
[LG-63] Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders
链接: https://arxiv.org/abs/2604.28176
作者: Emma Andrews,Sahan Sanjaya,Prabhat Mishra
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models can learn from data samples to carry out various tasks efficiently. When data samples are adversarially manipulated, such as by insertion of carefully crafted noise, it can cause the model to make mistakes. Quantum machine learning models are also vulnerable to such adversarial attacks, especially in image classification using variational quantum classifiers. While there are promising defenses against these adversarial perturbations, such as training with adversarial samples, they face practical limitations. For example, they are not applicable in scenarios where training with adversarial samples is either not possible or can overfit the models on one type of attack. In this paper, we propose an adversarial training-free defense framework that utilizes a quantum autoencoder to purify the adversarial samples through reconstruction. Moreover, our defense framework provides a confidence metric to identify potentially adversarial samples that cannot be purified the quantum autoencoder. Extensive evaluation demonstrates that our defense framework can significantly outperform state-of-the-art in prediction accuracy (up to 68%) under adversarial attacks.
[LG-64] Mapping the Phase Diagram of the Vicsek Model with Machine Learning
链接: https://arxiv.org/abs/2604.28167
作者: Grace T. Bai,Brandon B. Le
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures
Abstract:In this study, we use machine learning to classify and interpolate the phase structure of the Vicsek flocking model across the three-dimensional parameter space (\eta,\rho,v_0) . We construct a dataset of simulated parameter points and characterize each point using long-time dynamical observables. These observables are then used as inputs to a K-Means clustering procedure, which assigns each point to a disorder, order, or coexistence phase. Using these clustered labels, we train a neural-network classifier to learn the mapping from model parameters to phase behavior, achieving a classification accuracy of 0.92. The resulting phase map resolves a narrow coexistence region separating the ordered and disordered phases and extends the inferred phase boundaries beyond the originally sampled simulation points. More broadly, this approach provides a systematic way to convert sparse simulation data into a global phase diagram for collective-motion models.
[LG-65] Sequential Inference for Gaussian Processes: A Signal Processing Perspective
链接: https://arxiv.org/abs/2604.28163
作者: Daniel Waxman,Fernando Llorente,Petar M. Djurić
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 53 pages, 7 figures. Accepted to IEEE Signal Processing Magazine
Abstract:The proliferation of capable and efficient machine learning (ML) models marks one of the strongest methodological shifts in signal processing (SP) in its nearly 100-year history. ML models support the development of SP systems that represent complex, nonlinear relationships with high predictive accuracy. Adapting these models often requires sequential inference, which differs both theoretically and methodologically from the usual paradigm of ML, where data are often assumed independent and identically distributed. Gaussian processes (GPs) are a flexible yet principled framework for modeling random functions, and they have become increasingly relevant to SP as statistical and ML methods assume a more prominent role. We provide a self-contained, tutorial-style overview of GPs, with a particular focus on recent methodological advances in sequential, incremental, or streaming inference. We introduce these techniques from a signal-processing perspective while bridging them to recent advances in ML. Many of the developments we survey have direct applications to state-space modeling, sequential regression and forecasting, anomaly detection in time series, sequential Bayesian optimization, adaptive and active sensing, and sequential detection and decision-making. By organizing these advances from a signal-processing perspective, we intend to equip practitioners with practical tools and a coherent roadmap for deploying sequential GP models in real-world systems.
[LG-66] Assessing the Role of Intersection Proximity in Pedestrian Crashes: Insights from Data Mining Approach
链接: https://arxiv.org/abs/2604.28065
作者: Ahmed Hossain,Xiaoduan Sun,Subasish Das
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG)
*备注: 59 pages, 14 figures
Abstract:Although intersections are the most complex parts of the roadway network, pedestrian crashes at non-intersection locations are disproportionately frequent, highlighting a serious traffic safety concern. This study investigates non-intersection crashes involving pedestrians using a crash database (2017-2021) collected from Louisiana State. As the risk of pedestrian crashes tends to vary with distance from the intersection, the research team utilized a unique framework “distance to intersection” to capture the differences in crash patterns at non-intersection locations. The study identified that around 50% of non-intersection pedestrian crashes occurred within 198 ft. of the intersection. In the next step, the collected 3,135 pedestrian crashes at non-intersection locations during the study period were subdivided into three zones: D1 zone designates crashes occurring within 150 ft. of an intersection (1,277 crashes), D2 zone designates crashes occurring within 151 ft. to 435 ft. of an intersection (1,060 crashes) and D3 zone designates crashes occurring at 435 ft. or higher from an intersection (798 crashes). To explore the complex interaction of multiple factors, an intuitive data mining technique, Association Rules Mining was used. A total of the top 60 interesting association rules (20 for each zone) were identified by the algorithm (based on lift and support measures). In addition, a total of 124 rules were explored based on Lift Increase Criterion (LIC) measure. The findings of this research provide critical insights into pedestrian crash involvement at non-intersection locations and the variation in crash patterns according to the “distance to intersection”. Based on the findings, some of the targeted problem-specific countermeasures are also recommended to address the crash patterns at non-intersection locations.
[LG-67] Diffusion-OAMP for Joint Image Compression and Wireless Transmission
链接: https://arxiv.org/abs/2604.27952
作者: Wentao Hou,Yimin Bai,Zelei Luo,Jiadong Hong,Lei Liu
类目: Image and Video Processing (eess.IV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 2 tables, submitted for a possible publication
Abstract:Joint image compression and wireless transmission remain relatively underexplored compared to generic image restoration, despite its importance in practical communication systems. We formulate this problem under an equivalent linear model, and propose Diffusion-OAMP, a training-free reconstruction framework that embeds a pre-trained diffusion model into the OAMP algorithm. In Diffusion-OAMP, the OAMP linear estimator produces pseudo-AWGN observations, while the diffusion model serves as a nonlinear estimator under an SNR-matching rule. This framework offers a way to incorporate multiple generative priors into OAMP. Experiments with varying compression ratios and noise levels show that Diffusion-OAMP performs favorably against classic methods in the evaluated settings.
[LG-68] Prediction-powered Inference by Mixture of Experts
链接: https://arxiv.org/abs/2604.27892
作者: Yanwu Gu,Linglong Kong,Dong Xia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools create new opportunities for semi-supervised inference, in which labeled data are limited and expensive to obtain, whereas unlabeled data are abundant and widely available. Given a collection of predictors, we treat them as a mixture of experts (MOE) and introduce an MOE-powered semi-supervised inference framework built upon prediction-powered inference (PPI). Motivated by the variance reduction principle underlying PPI, the proposed framework seeks the mixture of experts that achieves the smallest possible variance. Compared with standard PPI, the MOE-powered inference framework adapts to the unknown performance of individual predictors, benefits from their collective predictive power, and enjoys a best-expert guarantee. The framework is flexible and applies to mean estimation, linear regression, quantile estimation, and general M-estimation. We develop non-asymptotic theory for the MOE-powered inference framework and establish upper bounds on the coverage error of the resulting confidence intervals. Numerical experiments demonstrate the practical effectiveness of MOE-powered inference and corroborate our theoretical findings.
[LG-69] Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing
链接: https://arxiv.org/abs/2604.27883
作者: Max Lovig
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 Pages, 7 Figures
Abstract:In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap’', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity – enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and 100% data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.
[LG-70] Data-Efficient Indentation Size Effect Correction in Steels Using Machine Learning and Physics-Guided Augmentation
链接: https://arxiv.org/abs/2604.27775
作者: Radmir Karamov,Tagir Karamov
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Preprint, 19 pages, 8 figures, 4 tables
Abstract:Shallow nanoindentation enables mechanical characterization of thin films, individual phases and other volume-constrained materials, but measured hardness is often inflated by the indentation size effect (ISE), contact-area errors and tip-geometry artifacts. Classical ISE corrections such as the Nix-Gao require a deep linear regime and are unreliable when only shallow measurements are used. This study investigates how a small experimental dataset can be used to predict a reference hardness with physics-guided feature engineering and augmentation. Approximately 700 experimental indentations were collected from three steel reference specimens covering a hardness range of 2-6.5 GPa and augmented using physically motivated variations representing instrumental noise, session-level drift, and local multiphase boundary blending. The input space combined Oliver-Pharr values with mechanics descriptors, including indentation work partitioning, ( H\text/E_r ), and the area-invariant compliance proxy ( P_\max\text/S^2 ). Ridge Regression (RR), Random Forest, XGBoost, and Neural Networks (NN) were evaluated using a quarantined fourth steel specimen tested at staggered loads. The hardness mapping was nonlinear: RR failed, whereas nonlinear models achieved ( R^2 0.98 ) internally. A constrained (64-8-64) NN gave the best results, reaching RMSE = 0.470 GPa, MAPE = 5.4% on the quarantined steel. Unlike Nix-Gao analysis, the NN produced stable estimates in the shallow regime. SHAP and latent-space analysis showed reliance on area-invariant and energy-based descriptors. The results demonstrate the feasibility of a this workflow for ISE correction in steels using small datasets and suggest a pathway toward data-efficient characterization of any volume constrained materials.
[LG-71] Sampling two-dimensional spin systems with transformers
链接: https://arxiv.org/abs/2604.27738
作者: Piotr Białas,Piotr Korcyl,Tomasz Stebel,Adam Stefański,Dawid Zapolski
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 15 pages, 7 figures
Abstract:Autoregressive Neural Networks based on dense or convolutional layers have recently been shown to be a viable strategy for generating classical spin systems. Unlike these methods, sampling with transformers is commonly considered to be computationally inefficient. In this work, we propose a novel approach to transformer-based neural samplers in which we generate not a single spin per step but groups of spins. As an additional improvement, we construct a model of approximated probabilities, further improving the efficiency of the algorithm. Despite our approach being computationally heavier than dense networks or CNN-based approaches, we were able to sample larger systems of up to 180 \times 180 spins in case of the Ising model. The Effective Sample Size of our sampler is \sim 20 times larger than that of the previous state-of-the-art neural sampler when trained for the 128 \times 128 Ising model at critical temperature. Finally, we also test our algorithm on the 2D Edwards-Anderson model, where we train 64\times 64 spin systems.
[LG-72] Bayesian X-Learner: Calibrated Posterior Inference for Heterogeneous Treatment Effects under Heavy-Tailed Outcomes
链接: https://arxiv.org/abs/2604.27394
作者: Eichi Uehara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 47 pages, 7 figures, 25 tables. Code: this https URL . Prepared for submission to TMLR
Abstract:Conditional Average Treatment Effect (CATE) estimation in practice demands three properties simultaneously: heterogeneous effects \tau(x) , calibrated uncertainty over them, and robustness to the heavy tails that contaminate real outcome data. Meta-learners (Künzel et al., 2019) give (i); causal forests and BART give (i)-(ii) with Gaussian-tail assumptions; no widely used tool gives all three. We present Bayesian X-Learner, an X-Learner built on cross-fitted doubly robust pseudo-outcomes (Kennedy, 2020) with a full MCMC posterior over \tau(x) via a Welsch redescending pseudo-likelihood. On Hill’s IHDP benchmark the default configuration attains mean \sqrt\varepsilon_\mathrmPEHE = 0.56 on 5 replications (lowest mean; differences from S-/T-/X-learners, full-config Causal BART, and a causal forest baseline are not significant at \alpha=0.05 , and rank ordering is unstable at 10 replications – IHDP comparisons are competitive rather than dominant). On contaminated “whale” DGPs with up to 20-25% tail density, a one-flag extension (contamination_severity) that selects a Huber- \delta nuisance loss per Huber’s minimax- \delta relation recovers RMSE \approx 0.13 with tight credible intervals (single-cross-fit 30-seed coverage 83% [Wilson 66%, 93%] at 20% density; modular-Bayes pooling with Bayesian-bootstrap nuisance draws restores nominal 95% coverage).
[LG-73] A Novel Computational Framework for Causal Inference: Tree-Based Discretization with ILP-Based Matching
链接: https://arxiv.org/abs/2604.27307
作者: Tianyu Yang,Md. Noor-E-Alam
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Causal inference is essential for data-driven decision-making, as it aims to uncover causal relationships from observational data. However, identifying causality remains challenging due to the potential for confounding and the distinction between correlation and causation. While recent advances in causal machine learning and matching algorithms have improved estimation accuracy, these methods often face trade-offs between interpretability and computational efficiency. This paper proposes a novel approach that combines a tree-based discretization technique, tailored for causal inference, with an integer linear programming-based matching algorithm. The discretization ensures approximately linear relationships for control datasets within strata, enabling effective matching, while the optimization framework optimizes for global balance. The resulting algorithm yields computational efficiency and less biased ATT estimates compared to state-of-the-art algorithms. Empirical evaluations demonstrate the proposed method’s practical advantages over existing techniques in causal inference scenarios.
[LG-74] Linear Models Variable Selection Artificial Intelligence
链接: https://arxiv.org/abs/2604.27191
作者: By Riyadh Alrawkan,Edward Boone,Ryad Ghanam,Anton Westveld
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Variable selection in linear regression models has been a problem since hypothesis testing began. Which variables to include or exclude from a model is not an easy task. Techniques such as Forward, Back ward, Stepwise Regression sequentially add or delete variables from a model. Penalized likelihood methods such as AIC, BIC, etc. seek to choose variables that have a significant contribution to the likelihood. Penalized sum of square methods such as LASSO and Elastic Net have been used to penalize small coefficients to only allow variables with large coefficients in the model. This work introduces an Artificial Intelligence approach to model selection where an ANN is trained to determine the significance of the variables based on OLS estimates. A simulation study shows the accuracy across various sample sizes and variances. Furthermore, a simulation study is conducted to compare the performance of the approach against Forward, Backward, AIC, BIC and LASSO. The approach is illustrated using a dataset from the World Health Organization regarding Life Expectancy. A github link is provided to the pretrained ANN that can handle up to 100 predictor variables, the original WHO dataset and the subset used in this work.
[LG-75] Man Machine and Mathematics
链接: https://arxiv.org/abs/2604.27052
作者: Akshunna S. Dogra
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 31 pages, 8 figures, 3 appendices. Survey article/scientific manifesto covering learning and optimisation from a broad perspective, especially within computational contexts
Abstract:Nonlinear models and optimization methods have successfully tackled a rapidly growing set of problems in recent years. Indeed, a relatively small toolbox of such models and methods can provide sufficient performance across a large landscape of tasks: deep learning alone has made significant recent contributions in scientific modelling, natural language processing, visual analysis, etc. A similar relationship exists between physical theories and phenomena, where many applications and observations emerge neatly from remarkably minimal foundations. It is natural to wonder if sparse unified frameworks could be built to steer discussion and discovery in the fields concerned with learning, optimization, and modelling. In this work, we posit and examine a possible outline for such a unified theory, interpreting the notion of ‘‘learning’’ in a broad sense. In particular, we pursue our goals by viewing learning as an inter-connected process on multiple levels: problem setup, choosing methods, and the analysis of their interplay via imposed optimisation dynamics. We begin by proposing a precise yet versatile definition for ‘‘solvable’’ problems. We then define the ‘‘parametrised methods’’ by which their solution(s) may be ‘‘learned’’. Our goal is to sketch a ‘‘universal convergence theorem’’, specifying how and when solvable problems become amenable to the methods chosen for them. We find these constructions reduce the study of learning down to remarkably few ideas and tools - many of which are simply adapted from existing ones in dynamical systems theory, geometry, and fundamental physics.
[LG-76] SCOPE-FE: Structured Control of Operator and Pairwise Exploration for Feature Engineering
链接: https://arxiv.org/abs/2604.27025
作者: Minhee Park,Seongyeon Son,Yonghyun Lee,Eunchan Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Automatic feature engineering is an effective approach for improving predictive performance in tabular learning. However, expand-and-reduce methods, such as OpenFE, become increasingly computationally expensive as the input dimensionality grows. This limitation arises primarily from the combinatorial explosion of candidate features generated through operator-feature combinations. To address this issue, we propose SCOPE-FE, a structured search space control framework that improves efficiency by reducing the candidate space prior to feature generation. SCOPE-FE jointly regulates two major sources of combinatorial growth: the operator space and feature-pair space. First, OperatorProbing estimates the dataset-specific utility of candidate operators and eliminates low-contribution operators in advance. Second, FeatureClustering employs spectral embedding and fuzzy c-means clustering to group structurally related features, thereby restricting candidate generation to relevant within-cluster combinations. In addition, we introduce ReliabilityScoring, which incorporates variance across subsamples to stabilize pruning decisions. Experiments on ten benchmark datasets demonstrate that SCOPE-FE substantially reduces feature engineering time while maintaining competitive predictive performance relative to existing baselines. The efficiency gains are particularly pronounced for high-dimensional datasets. These results indicate that structured control of the search space is an effective strategy for scalable automatic feature engineering. The code will be made publicly available upon acceptance.
[LG-77] Validating the Clinical Utility of CineECG 3D Reconstructions through Cross-Modal Feature Attribution ALT
链接: https://arxiv.org/abs/2604.27017
作者: Karol Dobiczek,Maciej Mozolewski,Szymon Bobek,Michał Szafarczyk,Peter van Dam,Grzegorz J. Nalepa
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the CompHealth workshop at the 26th International Conference on Computational Science
Abstract:Deep learning models for 12-lead electrocardiogram (ECG) analysis achieve high diagnostic performance but lack the intuitive interpretability required for clinical integration. Standard feature attribution methods are limited by the inherent difficulty in mapping abstract waveform fluctuations to physical anatomical pathologies. To resolve this, we propose a cross-modal method that projects feature attributions from high-performance 12-lead ECG models onto the CineECG 3D anatomical space. Our study reveals that while models trained directly on CineECG signals suffer from reduced accuracy and incoherent attributions, the proposed mapping mechanism effectively recovers clinically relevant feature rankings. Validated against a ground-truth dataset of 20 cases annotated by domain experts, the mapped explanations yield a Dice score of 0.56, significantly outperforming the 0.47 baseline of standard 12-lead attributions. These findings indicate that cross-modal averaging mapping effectively filters attribution instability and improves the localization of pathological features, combining the diagnostic expressiveness of standard ECG with the intuitive clarity of anatomical visualization.
附件下载


